That works great… thanks Shane. I have mostly used the shell to create text files. If I wanted to append the text to the next line of the file each time, is this handler structured properly?
Write to the end of the file letting the write command open/close the file for you (assuming it already exists):
write formattedRecordText to file csvPosixPath as «class ut16» starting at eof
Open the file for access and just write, letting the file-marker be moved automatically, and close when finished.
set fileRef to (open for access csvPosixPath with write permission)
-- move file-marker to end
read fileRef as «class ut16» for 0
-- then just do writes
repeat
-- whatever to get formattedRecordText
write formattedRecordText to fileRef as «class ut16»
end repeat
close access fileRef
The latter is more efficient in a tight loop, whereas the former is more convenient if you are writing less often.
It does look like AS might be writing BOMs with each write either way, though. I don’t know how Excel will cope with that. If worse comes to worse, you can write as utf8, then at the end read and re-write as utf16 in one go.
Interesting! ‘as «class ut16»’ writes in machine-native format with a BOM! (‘as Unicode text’ writes in big-endian format without one.) Learned something new.
Or write ‘as Unicode text’ after having made sure there’s a big-endian BOM (‘«data rdatFEFF»’) at the beginning of the file. Presumably, if Excel (or something in Mac OS) knows not to count a little-endian BOM at the beginning as part of the text, it would know how to handle a big-endian one.
This actually sets the file-marker to the beginning of the file, which isn’t necessary immediately after ‘open for access’. It’s also not really necessary to specify a read mode if you’re only reading for zero bytes.
Just for the record: The lack of Unicode UTF-8 support is a known bug in Microsoft Office and hasn’t been fixed for the last decade. The funny thing is that you would expect the same bug in Microsoft Office Word because it uses the same encoding panel and probably share application data, fortunately the bug has been fixed in Word 2011 (Unicode version 6.1).
One of the reasons why I only have Excel on my machine for when clients send me xlm files. All the other handling is done by a version of open office on my machine, that not only supports a wider variety of encodings but also handle larger csv files much better than Microsoft’s sheet editor. Microsoft’s Excel is a good and solid piece of software, as long as you don’t run it on a Mac
@Shane: Isn’t using raw event codes in AppleScript a little bit against your own rules, like iso datetime
Yes, it used to be that doing the first write as ut16 and following up with as Unicode text was the way to write UTF16.
Normally I’d agree, but .csv, and Excel’s support thereof, has an odd reputation. Part of the problem is that the only real .csv spec does say ASCII. I gather using other delimiters can make it behave better.
The difference is that the use of ut16 was originally suggested by an AppleScript engineer, at the same time that he suggested using as utf8 for utf8. Who am I to argue with that sort of authority?
(And yes, I think it’s rather sad that there isn’t terminology for both, but especially utf8, after all this time.)
I would like to create file if it doesn’t exist, that’s why open for access … ?
If file is written to every 30-60 seconds, should I keep it open?
I have read several posts about the need for proper error handling with open for access. What is the best way to accomplish this with proper error handlers?
Thanks to everyone’s help, script is creating properly formatted csv files encoded as UTF16. When I open the file in Excel, all characters display as expected. HOWEVER, Excel no longer honors the csv format and puts all of the data in the first column of each row.
I then copied and pasted the UTF16 file into a new text file and saved as UTF8. This time, Excel places the data in the proper fields, HOWEVER displays the characters incorrectly again.
A file can be opened many times by many different processes for read access. The file can be opened once, system wide, with write access. To keep an file open for write access or not do both have their cons and pros. In general binary files are opened with write access and keep it open until you close the associated document on your screen. ASCII files are opened with read access and closed immediately after. This mean that the file is also opened when it wants to save data and closed again. Because CSV is an ASCII file type I would say you close it every time, the system doesn’t have any problem with opening an closing a file so frequently. There are files in your system that will be opened and closed several times per second like pipes.
Can you create a tab delimited text file? As far as I know the problem is that Excel expects a single byte “;”, the actual delimiter you’re using is now a two byte value. This bug hasn’t been confirmed but a lot of import tools have this kind of bug. Maybe it’s working when you manually import the CSV file into field A1
When I create the file, I write space as utf8 to add the UTF8 BOM. I then write the rest of the data as UTF16.
My final question is, what is the least invasive way of writing UTF8 to “initialize” the file after it is created? What is the best way of adding the UTF8 BOM to the file?
tell application "System Events" to set csvFileExists to (exists file csvPosixPath)
if not csvFileExists then
set fileRef to (open for access csvPosixPath with write permission)
write space to fileRef as «class utf8»
close access csvPosixPath
end if
set csvData to linefeed & csvData
write csvData to csvPosixPath as «class ut16» starting at eof
set fileRef to (open for access csvPosixPath with write permission) -- Creates the file too if it doesn't already exist
try -- Everything that happens while the access is open should be in a 'try' statement to ensure that the access isn't left open after an error.
write «data rdatEFBBBF» to fileRef -- UTF-8 Byte-Order Mark (three bytes).
end try
close access fileRef
The purpose of a UTF-8 BOM at the beginning of the file is to tell the application reading it explicitly that the rest of the data is UTF-8 Unicode text. There’s no point in having a UTF-8 BOM and then writing everything else as UTF-16.
Actually, in this case there is (until you correct me ;)). A UTF8 csv recognizes the delimited fields, however does not display certain characters correctly. A UT16 csv displays characters correctly, but does not recognize delimited fields. All data is written in column A. However, by prepending a UTF8 BOM to the csv file and then writing the rest in UT16, both characters and delimited fields are recognized.
I can’t argue with success (although I hope you try some bigger files), but the need to do such an illogical thing is yet another reason to consider using tabbed text rather than csv. The idea honestly makes me shudder…