Text Encoding for Excel csv file

adayzdone · February 9, 2014, 9:31pm

How can I create an csv file encoded for excel?


set myString to "RÃ©gulier"
set myPath to POSIX path of ((path to desktop as text) & "test.csv")
do shell script "echo " & quoted form of myString & " >> " & quoted form of myPath
-- Returns RÃƒ©gulier when opened in Excel

Shane_Stanley · February 10, 2014, 1:32am

I believe the problem is with Excel and/or the .csv format. Excel expects ASCII text for .csv files.

adayzdone · February 10, 2014, 2:25am

Hi Shane,

Using Excel 14.3.9, If I set A1 to “RÃ©gulier”, save the file as csv and re-open it, the problem persists. However, if I save the file from Excel as UTF-16 Unicode Text and re-open it, it opens properly.

This post talks about using iconv to correct the issue. Would that work?
https://stackoverflow.com/questions/4221176/excel-to-csv-with-utf8-encoding

Shane_Stanley · February 10, 2014, 2:37am

So save it as UTF16:

set myString to "RÃ©gulier"
set myPath to ((path to desktop as text) & "test.csv")
set fileRef to (open for access file myPath with write permission)
set eof fileRef to 0
write myString to fileRef as «class ut16»
close access fileRef

Using do shell script, echo and iconv is going the long way home.

adayzdone · February 10, 2014, 4:04am

That works great… thanks Shane. I have mostly used the shell to create text files. If I wanted to append the text to the next line of the file each time, is this handler structured properly?

set formattedRecordText to "RÃ©gulier"
set csvPosixPath to "/Users/me/Desktop/text.csv"

repeat
	try
		set fileRef to (open for access csvPosixPath with write permission)
		exit repeat
	on error
		close access csvPosixPath
	end try
end repeat

if (get eof fileRef) â‰  0 then set formattedRecordText to linefeed & formattedRecordText
write formattedRecordText to fileRef as «class ut16» starting at eof
close access fileRef

Shane_Stanley · February 10, 2014, 4:23am

No. You have two choices:

Write to the end of the file letting the write command open/close the file for you (assuming it already exists):

write formattedRecordText to file csvPosixPath as «class ut16» starting at eof

Open the file for access and just write, letting the file-marker be moved automatically, and close when finished.

 set fileRef to (open for access csvPosixPath with write permission)
-- move file-marker to end
read fileRef as «class ut16» for 0
-- then just do writes
repeat 
-- whatever to get formattedRecordText
write formattedRecordText to fileRef as «class ut16»
end repeat
close access fileRef

The latter is more efficient in a tight loop, whereas the former is more convenient if you are writing less often.

It does look like AS might be writing BOMs with each write either way, though. I don’t know how Excel will cope with that. If worse comes to worse, you can write as utf8, then at the end read and re-write as utf16 in one go.

Nigel_Garvey · February 10, 2014, 8:39am

Interesting! ‘as «class ut16»’ writes in machine-native format with a BOM! (‘as Unicode text’ writes in big-endian format without one.) Learned something new.

Or write ‘as Unicode text’ after having made sure there’s a big-endian BOM (‘«data rdatFEFF»’) at the beginning of the file. Presumably, if Excel (or something in Mac OS) knows not to count a little-endian BOM at the beginning as part of the text, it would know how to handle a big-endian one.

This actually sets the file-marker to the beginning of the file, which isn’t necessary immediately after ‘open for access’. It’s also not really necessary to specify a read mode if you’re only reading for zero bytes.

DJ_Bazzie_Wazzie · February 10, 2014, 11:15am

Hi Again John,

Just for the record: The lack of Unicode UTF-8 support is a known bug in Microsoft Office and hasn’t been fixed for the last decade. The funny thing is that you would expect the same bug in Microsoft Office Word because it uses the same encoding panel and probably share application data, fortunately the bug has been fixed in Word 2011 (Unicode version 6.1).

One of the reasons why I only have Excel on my machine for when clients send me xlm files. All the other handling is done by a version of open office on my machine, that not only supports a wider variety of encodings but also handle larger csv files much better than Microsoft’s sheet editor. Microsoft’s Excel is a good and solid piece of software, as long as you don’t run it on a Mac

@Shane: Isn’t using raw event codes in AppleScript a little bit against your own rules, like iso datetime

Shane_Stanley · February 10, 2014, 11:49am

Yes, it used to be that doing the first write as ut16 and following up with as Unicode text was the way to write UTF16.

Normally I’d agree, but .csv, and Excel’s support thereof, has an odd reputation. Part of the problem is that the only real .csv spec does say ASCII. I gather using other delimiters can make it behave better.

Oops, you’re right of course.

Shane_Stanley · February 10, 2014, 12:04pm

The difference is that the use of ut16 was originally suggested by an AppleScript engineer, at the same time that he suggested using as utf8 for utf8. Who am I to argue with that sort of authority?

(And yes, I think it’s rather sad that there isn’t terminology for both, but especially utf8, after all this time.)

DJ_Bazzie_Wazzie · February 10, 2014, 12:52pm

I agree, all joking aside, you gave john a great answer. I wasn’t aware of the raw utf16 class either, only the unicode string class.

adayzdone · February 10, 2014, 2:43pm

I would like to create file if it doesn’t exist, that’s why open for access … ?

If file is written to every 30-60 seconds, should I keep it open?

I have read several posts about the need for proper error handling with open for access. What is the best way to accomplish this with proper error handlers?

set csvPosixPath to "/Users/me/Desktop/text.csv"
set myData to {"RÃ©gulier a", "RÃ©gulier b"}

repeat with formattedRecordText in myData
	
	-- Create file if doesn't exist 
	-- if file is already open, close it, reopen it to create fileRef
	repeat
		try
			set fileRef to (open for access csvPosixPath with write permission)
			exit repeat
		on error
			close access csvPosixPath
		end try
	end repeat
	
	-- 30-60 seconds delay here to create formattedRecordText in other programs
	-- Not sure if I should keep file open or not
	
	--If file has at least one row, add linefeed before writing next formattedRecordText
	if (get eof fileRef) â‰  0 then set formattedRecordText to linefeed & formattedRecordText
	write formattedRecordText to fileRef as «class ut16» starting at eof
	close access fileRef
	
end repeat

adayzdone · February 10, 2014, 3:27pm

Wow! This is frustrating.

Thanks to everyone’s help, script is creating properly formatted csv files encoded as UTF16. When I open the file in Excel, all characters display as expected. HOWEVER, Excel no longer honors the csv format and puts all of the data in the first column of each row.

I then copied and pasted the UTF16 file into a new text file and saved as UTF8. This time, Excel places the data in the proper fields, HOWEVER displays the characters incorrectly again.

Any ideas?

DJ_Bazzie_Wazzie · February 10, 2014, 4:35pm

A file can be opened many times by many different processes for read access. The file can be opened once, system wide, with write access. To keep an file open for write access or not do both have their cons and pros. In general binary files are opened with write access and keep it open until you close the associated document on your screen. ASCII files are opened with read access and closed immediately after. This mean that the file is also opened when it wants to save data and closed again. Because CSV is an ASCII file type I would say you close it every time, the system doesn’t have any problem with opening an closing a file so frequently. There are files in your system that will be opened and closed several times per second like pipes.

DJ_Bazzie_Wazzie · February 10, 2014, 4:39pm

Can you create a tab delimited text file? As far as I know the problem is that Excel expects a single byte “;”, the actual delimiter you’re using is now a two byte value. This bug hasn’t been confirmed but a lot of import tools have this kind of bug. Maybe it’s working when you manually import the CSV file into field A1

adayzdone · February 10, 2014, 5:32pm

I think I got it (with everyone’s help of course)!

This article below (and I think Nigel’s post) recommend

https://stackoverflow.com/questions/155097/microsoft-excel-mangles-diacritics-in-csv-files

When I create the file, I write space as utf8 to add the UTF8 BOM. I then write the rest of the data as UTF16.
My final question is, what is the least invasive way of writing UTF8 to “initialize” the file after it is created? What is the best way of adding the UTF8 BOM to the file?

tell application "System Events" to set csvFileExists to (exists file csvPosixPath)

if not csvFileExists then
	set fileRef to (open for access csvPosixPath with write permission)
	write space to fileRef as «class utf8»
	close access csvPosixPath
end if

set csvData to linefeed & csvData
write csvData to csvPosixPath as «class ut16» starting at eof

One side effect of this method:

Nigel_Garvey · February 10, 2014, 8:58pm

Ungh?

To write a UTF-8 BOM to the file:

set fileRef to (open for access csvPosixPath with write permission) -- Creates the file too if it doesn't already exist
try -- Everything that happens while the access is open should be in a 'try' statement to ensure that the access isn't left open after an error.
	write «data rdatEFBBBF» to fileRef -- UTF-8 Byte-Order Mark (three bytes).
end try
close access fileRef

The purpose of a UTF-8 BOM at the beginning of the file is to tell the application reading it explicitly that the rest of the data is UTF-8 Unicode text. There’s no point in having a UTF-8 BOM and then writing everything else as UTF-16.

Shane_Stanley · February 10, 2014, 10:53pm

Use it once to create (and close) the file. From then on you don’t need to use it.

I wouldn’t. See DJ’s answer.

adayzdone · February 11, 2014, 11:46pm

Awesome… thanks. Any suggested reading for BOM?

Actually, in this case there is (until you correct me ;)). A UTF8 csv recognizes the delimited fields, however does not display certain characters correctly. A UT16 csv displays characters correctly, but does not recognize delimited fields. All data is written in column A. However, by prepending a UTF8 BOM to the csv file and then writing the rest in UT16, both characters and delimited fields are recognized.

Create a few csv files and try it yourself.

Shane_Stanley · February 12, 2014, 7:00am

I can’t argue with success (although I hope you try some bigger files), but the need to do such an illogical thing is yet another reason to consider using tabbed text rather than csv. The idea honestly makes me shudder…