Writing unicode characters in plain text successfully?

Riku · November 3, 2016, 8:44pm

It’s my first time logging variables into plain text files.
The point is to create a new text file if it doesn’t exist, but if it does, writing to the bottom of the file. I’m using this neat tip I found, to have it happen in the background:

	set TheFileName to "Just testing" as text
set ContentTesting to "â”€â”‚â”ƒâ•¯â”£â”—â” â–’ â–  â•¾ â–¸ testing" as text
set the textFile to (("Users:YOURNAME:Desktop:") & (TheFileName & ".txt")) as text
open for access file the textFile with write permission
write (ContentTesting & return) to file the textFile starting at eof
close access file the textFile

The problem is that plain text records with many different types of information can get quite difficult to read, so I’d like it to write using box drawing characters. If I open TextEdit, type in some of such characters as well as text, then hit “Make Plain Text”, I will see the alphabets change into plain text characters, and the box drawing characters survive this de-formatting and stay as they are. (Font: Menlo Regular 11)
If I copy the desired box drawing into Script Editor, the script compiles without problem and the characters maintain their form. But using the script above, the box characters will end up as question marks in the text file. Set as Unicode text didn’t make a difference. I also thought of saving the special character templates as text files on the disk so the script could read from them and quote, but I still end up with question marks… Is this impossible?

Nigel_Garvey · November 4, 2016, 12:25am

HI.

“Plain text” means that the text isn’t styled ” ie. it’s no longer RTF. It’s still Unicode. To write this to file, you have to write it ‘as Unicode text’, or preferably ‘as «class utf8»’. Most text applications will be able to recognise the latter when they read the file. Certainly TextEdit.

set TheFileName to "Just testing" as text
set ContentTesting to "â”€â”‚â”ƒâ•¯â”£â”—â” â–’ â–  â•¾ â–¸ testing"
set the textFile to (path to desktop as text) & (TheFileName & ".txt")
set accessRef to (open for access file the textFile with write permission)
try
	write (ContentTesting & return) to accessRef as «class utf8» starting at eof
end try
close access accessRef

Riku · November 4, 2016, 10:14am

Oh smooth, that works! Thank you!
For some reason, “as Unicode text” indeed didn’t work for me, but «class utf8» does, and it’s a trick I didn’t know of until now.

Nigel_Garvey · November 4, 2016, 11:24am

Its use in the ‘write’ command can be problematic as it causes the text to be written to the file in big-endian UTF-16 form without a byte-order mark. This is for historical reasons dating back to when Macs had big-endian processors. Not all applications recognise or know how to deal with it.

It causes the text to be written in UTF-8 form, which is what applications tend to expect in text files nowadays anyway. For some reason, the «class utf8» class code has never been given its own keyword.

DJ_Bazzie_Wazzie · November 4, 2016, 12:59pm

Caused by Apple’s own recommended read/write APIs

Shane_Stanley · November 4, 2016, 11:49pm

It’s true it doesn’t write a BOM, but given that the OP is asking about “writing to the bottom” of files, a BOM may not help – indeed, you probably don’t want one anywhere but at the beginning. You can get a BOM using «class ut16», so assuming you’re writing the whole file, you could use «class ut16» for the first write, and Unicode text for the rest. Fiddly perhaps, but workable.

The bigger issue with appending text is consistency. Appending UTF-8 often “works”, in English at least, even when the original was written in some other encoding simply because the base ASCII encodings are common to many encodings. But it can also introduce problems if the original text is less basic.

Anyway, I reckon you need a good reason to use anything other than UTF-8 these days.

Nigel_Garvey · November 5, 2016, 10:11am

The problem with that is that writes ‘as «class ut16»’ have the native endianity (?) of the writing computer, which means little-endian on Intel Macs. The first line would come out right when the file was read back, but the rest, written in big-endian form, would have the “Chinese characters” effect. It would be better to write every line ‘as Unicode text’ with a big-endian BOM at the beginning of the file:


if ((get eof accessRef) is 0) then write «data rdatFEFF» to accessRef
write (ContentTesting & return) to accessRef as Unicode text starting at eof

Shane_Stanley · November 5, 2016, 10:41am

Ah, I missed that. All the more reason to stick with UTF-8.

DJ_Bazzie_Wazzie · November 5, 2016, 12:37pm

Endianness LE and BE are terms from Gulliver’s Travels by the way.

True. For streams, like disk or network, UTF-8 is the better choice while UTF-16 and UTF-32 are much better for processing. So inside a program you’re using UTF-16 or UTF-32 and you’ll use UTF-8 when it leaves the program. Actually write "as «class utf8» does that actually even if the scripter is not aware of it.

p.s. If you write a program who only uses the text file (not meant to be used by user or other programs) it’s safe and faster to dump wchar_t memory rather than re-encoding it on write.

Nigel_Garvey · November 5, 2016, 4:03pm

And I’ve learnt that from a Dutchman! :lol:

Actually, I’ve always used “endianness” in the past, but this time “endianity” appealed to my English speaker’s sense of how to form a qualitative noun from an adjective ending in “-an”. “Endianness” suggests the fact of being endian, or the degree to which something is endian, more than it does the type or quality of ” er ” endianity possessed. I see both forms are used in the latter sense out there on the Web. Both are pretty horrible.

DJ_Bazzie_Wazzie · November 5, 2016, 4:21pm

I have to come clean here. I read the endian.h and OSByteOrder.h files which I needed in AppleScript Toolbox because AppleEvent fourCharCodes (32 bit integer) have the old PPC endian. In the commentary you’ll read “Endianess”, “Endian’ness”, “Endian-ness” and “Endianness”. For my documentation I had to look it up on the web what the correct syntax was because it was confusing :rolleyes:

Shane_Stanley · November 6, 2016, 1:00am

Are you sure it’s going to be faster? I remember when Apple introduced binary .plist files, the argument was that the extra time spent encoding was swamped by the savings in I/O time because of the smaller file size. At least for non-accented roman languages, UTF-8 is going to be considerably smaller than wchar_t.

DJ_Bazzie_Wazzie · November 7, 2016, 2:25pm

I have to answer this with two different subjects:

XML has processing and data overhead bigger than UTF-16 vs. UTF-8:
The reason for binary plist is because it has less overhead. The XML file is not only filled with unnecessary white spaces, it has to be parsed, validated before it can be transformed into an opaque types (classes). It also leaves a large memory footprint. In general XML based files does have a lot of overhead in markup characters. Take the following example from the man page:

The above example contains 128 characters. values john and Kyra only contains 8. Which means that the length of the string is 16 times bigger because of markup. With binary files you can read data right into C structs (classes), of course the plist file contains headers containing the offsets of all keys and values but they are all digits. So at the end the XML file is not only bigger in size but also require a lot more processing than binary.

CFString can contain an 8-bit storage
So why are all the strings stored as UTF-8 in a binary plist? It makes sense when I quote the following about the CFString backing storage:

As the documentation says, CFStrings can contain an 8-bit character backing storage but 16-bits encoding as well. That means that when written to a file it is uncertain which encoding is used and no matter what encoding you choose you’ll end up using re-encoding anyway. Therefore choosing an encoding with smallest size is more obvious then.

Conlusion:
To choose binary is faster because of faster reading and writing because XML is slow. To choose UTF8 strings in the binary format is because you cannot simply dump CFString storage container into a file.

Shane_Stanley · November 7, 2016, 11:02pm

I’m trusting my memory, which isn’t always a wise thing to do, but I’m pretty sure it was acknowledged at the time that processing binary property lists takes longer – the payback was in the quicker I/O for the smaller size.

DJ_Bazzie_Wazzie · November 8, 2016, 10:56am

It’s both I/O speed and compactness, I/O remains the same if data differs in size. It’s a win-win, I/O is faster and its smaller in size.

Every character of the XML files require processing, for binary it’s only the structure and data information but not the actual data itself. Read will be done in three steps. First read the data type than read data size and at last read the data. however the data can be read as a whole and doesn’t need interpretation character by character like XML does. Each character of the XML file is processed but not each byte of the binary file, most of it is just copied without interpretation.

For the record: I’m not saying you’re wrong but the current Apple documentation says otherwise, how it was in the past I can’t say