Thursday, October 18, 2018

#1 2017-12-31 02:26:49 pm

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4692

Multi-byte characters and 'sed'

Every now and then, it becomes apparent that sed treats the UTF-8 text it's handling simply as a sequence of bytes and that it actually has no conception of multi-byte characters at all. The 'y' command, for example, is supposed to replace every instance of a character listed between its first two delimeters with the same-numbered character in the list between its second and third delimiters, like this:

Applescript:

do shell script "echo 'aaxyvaxz' | sed 'y/xyz/rdk/'" -- Replace all "x"s with "r"s, "y"s with "d"s, and "z"s with "k"s.
--> "aardvark"

This works when all the characters are single bytes, but with a mixture of of single- and multi-byte characters, sed may substitute individual bytes from different characters …

Applescript:

do shell script "echo 'garçon' | sed 'y/cç/çc/'"
--> "garßcon"

… or complain that the number of characters to be replaced doesn't match the number of characters to replace them:

Applescript:

do shell script "echo 'Leoś Janáček' | sed 'y/áčś/acs/'"
--> error "sed: 1: \"y/áčś/acs/\": transform strings are not the same length"

It turns out that this isn't sed's fault, but a consequence of the fact that the shell's default language/locale for the interpretation of characters is C. If the working locale's changed to one which uses UTF-8, sed's then perfectly able to handle multiple-byte characters as programmed:

Applescript:

do shell script "echo 'Leoś Janáček' | LC_CTYPE='en_GB' sed 'y/áčś/acs/'"
--> "Leos Janacek"

I've used British English here as the locale for character type, but any of the locales returned by the following script work just as well:

Applescript:

do shell script "locale -a | egrep '^[^_]+_[^.]+(.UTF-8)?$'"

There's also a choice of which environment variable to set: LC_ALL, LC_CTYPE, or LANG. Setting any of them to your preferred locale will work. But in the unlikely event of two or more of them being set in the same shell script, and to different values, LC_ALL takes priority over LC_CTYPE, which in turn takes priority over LANG.

I think that from now on, I'll explictly set a locale whenever I use sed. Handling UTF-8 text as such has to be a good idea, even if it doesn't usually make any difference. The same goes for grep, whose 'man' spiel also notes the influence of the environment variables. awk returns the length of "garçon" as "7", but setting the shell's environment variables doesn't seem to affect this. I don't know awk well enough to know if or how it can be changed.

I couldn't resist finishing with a celebratory flourish:  smile

Applescript:

use AppleScript version "2.5"
use scripting additions

do shell script "echo 'x-93yß<⌘wß⌘-r7ßßq' | LC_ALL='en_GB' sed 'y?<379-x⌘qß?Np!paHe" & character id 127863 & "Y ?'"

Last edited by Nigel Garvey (2017-12-31 03:01:37 pm)


NG

Offline

 

#2 2018-01-01 12:08:11 pm

bmose
Member
From:: Massachusetts
Registered: 2006-01-03
Posts: 281

Re: Multi-byte characters and 'sed'

And same to you, Nigel!

Very helpful information when using sed with multi-byte characters.

Offline

 

Board footer

Powered by FluxBB

RSS (new topics) RSS (active topics)