Multi-byte characters and 'sed'

Nigel_Garvey · December 31, 2017, 8:26pm

Every now and then, it becomes apparent that sed treats the UTF-8 text it’s handling simply as a sequence of bytes and that it actually has no conception of multi-byte characters at all. The ‘y’ command, for example, is supposed to replace every instance of a character listed between its first two delimeters with the same-numbered character in the list between its second and third delimiters, like this:

do shell script "echo 'aaxyvaxz' | sed 'y/xyz/rdk/'" -- Replace all "x"s with "r"s, "y"s with "d"s, and "z"s with "k"s.
--> "aardvark"

This works when all the characters are single bytes, but with a mixture of of single- and multi-byte characters, sed may substitute individual bytes from different characters …

do shell script "echo 'garçon' | sed 'y/cç/çc/'"
--> "garßcon"

… or complain that the number of characters to be replaced doesn’t match the number of characters to replace them:

do shell script "echo 'Leoś Janáček' | sed 'y/áčś/acs/'"
--> error "sed: 1: \"y/áčś/acs/\": transform strings are not the same length"

It turns out that this isn’t sed’s fault, but a consequence of the fact that the shell’s default language/locale for the interpretation of characters is C. If the working locale’s changed to one which uses UTF-8, sed’s then perfectly able to handle multiple-byte characters as programmed:

do shell script "echo 'Leoś Janáček' | LC_CTYPE='en_GB' sed 'y/áčś/acs/'"
--> "Leos Janacek"

I’ve used British English here as the locale for character type, but any of the locales returned by the following script work just as well:

do shell script "locale -a | egrep '^[^_]+_[^.]+(.UTF-8)?$'"

There’s also a choice of which environment variable to set: LC_ALL, LC_CTYPE, or LANG. Setting any of them to your preferred locale will work. But in the unlikely event of two or more of them being set in the same shell script, and to different values, LC_ALL takes priority over LC_CTYPE, which in turn takes priority over LANG.

I think that from now on, I’ll explictly set a locale whenever I use sed. Handling UTF-8 text as such has to be a good idea, even if it doesn’t usually make any difference. The same goes for grep, whose ‘man’ spiel also notes the influence of the environment variables. awk returns the length of “garçon” as “7”, but setting the shell’s environment variables doesn’t seem to affect this. I don’t know awk well enough to know if or how it can be changed.

I couldn’t resist finishing with a celebratory flourish:

use AppleScript version "2.5"
use scripting additions

do shell script "echo 'x-93yß<⌘wß⌘-r7ßßq' | LC_ALL='en_GB' sed 'y?<379-x⌘qß?Np!paHe" & character id 127863 & "Y ?'"

bmose · January 1, 2018, 6:08pm

And same to you, Nigel!

Very helpful information when using sed with multi-byte characters.