Convert XLS files into CSV preserving UTF-8 encoding

That is all it takes, and 16€ :smiley:

Now, if the op inserts MACCENTRALEUROPE for MACROMAN or CP1250 if that doesn’t work, his problems should be solved.

I come to think of it, that if he meets a comma, and the values are comma separated, then this may lead to problems, , may he is better off using “;” for “,” when generating the csv.

He can choose his field separator in awk anyways. with FS=“;”.

Sorry guys, maybe I didn’t make clear enough what I need to do.

I often get Excel files that contain the same text in different languages, so I need to merge them to create an import file that needs to be CSV UTF-8.

So the workflow would be:
1 “ Select files to merge
2 - If they are xls|xlsx files convert them to UTF-8 csv
3 - Type output filename and select output folder
4 - Merge keeping the header row only of the first file

Point 4 can be achieved using the following script; points 1, 2 and 3 are still a question mark eheh :slight_smile:

set theFiles to choose file with multiple selections allowed
set fileName to text returned of (display dialog "Type file name:" default answer ".csv")
set saveFile to POSIX path of (((path to desktop) as text) & fileName)

do shell script "cp " & quoted form of POSIX path of item 1 of theFiles & space & quoted form of saveFile
if (count of theFiles) = 1 then return
repeat with fileToAdd in theFiles
	set x to quoted form of POSIX path of fileToAdd
	do shell script "cat " & x & " | awk 'BEGIN {RS=\"\\r\\n\"; getline}{print $0}' >> " & quoted form of saveFile
end repeat

(1) I clearly wrote the cost in my precedent message.

(2) I made my tests running the system in French so the csv files were using the semi-colon as delimiter :slight_smile:

I also tried to replace the semi-colons by TAB characters and save the edited file as xyz.txt.

Excel import it wrongly too.
It seams that Microsoft didn’t fully understand what the switch to Unicode imply.

Yvan KOENIG (VALLAURIS, France) samedi 8 septembre 2012 15:54:20

What are you wanting to do with the merged CSV file ?

If you want to open it with Excel, you will always get odd results.

Exporting the csv as xls with Numbers do the wanted trick (at least with the encoding used by Excel in French).
I’m not sure with the one used for Czech.

If you really need to open the final file in Excel, I will try to get a link allowing you to download the trial version of iWork '09 (Apple removed it from its servers).

[b]You may get the trial version thru :

https://discussions.apple.com/message/19118374

the wanted link is in Peggy’s message dated 31 juil. 2012 18:33[/b]

Yvan KOENIG (VALLAURIS, France) samedi 8 septembre 2012 16:01:34

Hello!

Try accumulating all your output in one string pendolo, that you write out when your reading is done, and see if that helps!

Enjoy your day, I’m out! :slight_smile:

it needs to be imported into a content management system, it doesn’t need to open in Excel. Only some some source files are received in xls so they need to be converted into csv.

No need to open the final file in Excel

Opening the file in Numbers, which I have, and exporting works, but I want to automate this with a script, because there are from 10 to 30 files to process every time.

:slight_smile:

I’m using this file as a test, can you try it with this?

http://www.sendspace.com/file/hq55j9

Hello!

After some testing I found this to work!

tell application "Microsoft Excel"
	activate
	
	set outFile to (path of active workbook)
	set fileName to (name of active workbook)
	set outFile to (outFile & ":" & fileName & ".csv")
	try
		set openFile to open for access file outFile with write permission
		set eof openFile to 0
		
		set lastCol to count of columns of used range of active sheet
		set lastRow to count of rows of used range of active sheet
		
		repeat with r from 1 to lastRow
			set cellVal to (value of cell r of column 1 of active sheet) as text
			if cellVal is not "" then
				if text 1 of cellVal as text is not "," then
					set rowStr to cellVal
				else
					set rowStr to "0" & cellVal
				end if
			else
				set rowStr to ""
			end if
			repeat with c from 2 to lastCol
				set cellVal to (value of cell r of column c of active sheet) as text
				if cellVal is not "" then
					if text 1 of cellVal is not "," then
						set rowStr to rowStr & "," & cellVal
					else
						set rowStr to rowStr & "," & "0" & cellVal
					end if
				else
					set rowStr to rowStr & ","
				end if
				
			end repeat
			write rowStr & return to openFile as «class utf8»
		end repeat
		close access openFile
	on error e number n
		
		-- Chris Stone
		set {cr, sep} to {return, "------------------------------------------"}
		set errmsg to sep & cr & "Error: " & e & cr & sep & cr & "Error 
		Number: " & n & cr & sep
		tell application "SystemUIServer"
			activate
			display dialog errmsg
		end tell
		set fail to true
	end try
end tell

Oh yes! It does work! :smiley:

However it still breaks the column at the comma… :slight_smile:

Hello.

As I suggested in a post above, you should really use semi colon, or colon as a field separator when you are using a “,” as the decimal separator.

I have changed the script to use semi-colon!

tell application "Microsoft Excel"
	activate
	
	set outFile to (path of active workbook)
	set fileName to (name of active workbook)
	set outFile to (outFile & ":" & fileName & ".csv")
	try
		set openFile to open for access file outFile with write permission
		set eof openFile to 0
		
		set lastCol to count of columns of used range of active sheet
		set lastRow to count of rows of used range of active sheet
		
		repeat with r from 1 to lastRow
			set cellVal to (value of cell r of column 1 of active sheet) as text
			if cellVal is not "" then
				if text 1 of cellVal as text is not ";" then
					set rowStr to cellVal
				else
					set rowStr to "0" & cellVal
				end if
			else
				set rowStr to ""
			end if
			repeat with c from 2 to lastCol
				set cellVal to (value of cell r of column c of active sheet) as text
				if cellVal is not "" then
					if text 1 of cellVal is not ";" then
						set rowStr to rowStr & ";" & cellVal
					else
						set rowStr to rowStr & ";" & "0" & cellVal
					end if
				else
					set rowStr to rowStr & ";"
				end if
				
			end repeat
			write rowStr & return to openFile as «class utf8»
		end repeat
		close access openFile
	on error e number n
		
		-- Chris Stone
		set {cr, sep} to {return, "------------------------------------------"}
		set errmsg to sep & cr & "Error: " & e & cr & sep & cr & "Error 
		Number: " & n & cr & sep
		tell application "SystemUIServer"
			activate
			display dialog errmsg
		end tell
		set fail to true
	end try
end tell

Hello

Here is an edited version taking care of the local delimiter.


character 2 of (0.5 as text)
if result is "," then
	set delim to ";"
else
	set delim to ","
end if

tell application "Microsoft Excel"
	activate
	
	set outFile to (path of active workbook)
	set fileName to (name of active workbook)
	set outFile to (outFile & ":" & fileName & ".csv")
	try
		set openFile to open for access file outFile with write permission
		set eof openFile to 0
		
		set lastCol to count of columns of used range of active sheet
		set lastRow to count of rows of used range of active sheet
		
		repeat with r from 1 to lastRow
			set cellVal to (value of cell r of column 1 of active sheet) as text
			if cellVal is not "" then
				if text 1 of cellVal as text is not delim then
					set rowStr to cellVal
				else
					set rowStr to "0" & cellVal
				end if
			else
				set rowStr to ""
			end if
			repeat with c from 2 to lastCol
				set cellVal to (value of cell r of column c of active sheet) as text
				if cellVal is not "" then
					if text 1 of cellVal is not delim then
						set rowStr to rowStr & delim & cellVal
					else
						set rowStr to rowStr & delim & "0" & cellVal
					end if
				else
					set rowStr to rowStr & delim
				end if
				
			end repeat
			write rowStr & return to openFile as «class utf8»
		end repeat
		close access openFile
	on error e number n
		
		-- Chris Stone
		set {cr, sep} to {return, "------------------------------------------"}
		set errmsg to sep & cr & "Error: " & e & cr & sep & cr & "Error 
		Number: " & n & cr & sep
		tell application "SystemUIServer"
			activate
			display dialog errmsg
		end tell
		set fail to true
	end try
end tell


Yvan KOENIG (VALLAURIS, France) samedi 8 septembre 2012 16:58:47

Hmm, tried McUsr’s latest version and the columns are merged into one :smiley:

Tried Yvan Koenig’s version and same column break after comma happens :smiley:

Does it happen to you too using this file?

http://www.sendspace.com/file/hq55j9

Hello!

This ought to work.

I’ll try it on the file.

tell application "Microsoft Excel"
	activate
	
	set outFile to (path of active workbook)
	set fileName to (name of active workbook)
	set outFile to (outFile & ":" & fileName & ".csv")
	try
		set openFile to open for access file outFile with write permission
		set eof openFile to 0
		
		set lastCol to count of columns of used range of active sheet
		set lastRow to count of rows of used range of active sheet
		
		repeat with r from 1 to lastRow
			set cellVal to (value of cell r of column 1 of active sheet) as text
			
			set rowStr to cellVal
			repeat with c from 2 to lastCol
				set cellVal to (value of cell r of column c of active sheet) as text
				set rowStr to rowStr & ";" & cellVal
			end repeat
			write rowStr & return to openFile as «class utf8»
		end repeat
		close access openFile
	on error e number n
		
		-- Chris Stone
		set {cr, sep} to {return, "------------------------------------------"}
		set errmsg to sep & cr & "Error: " & e & cr & sep & cr & "Error 
		Number: " & n & cr & sep
		tell application "SystemUIServer"
			activate
			display dialog errmsg
		end tell
		set fail to true
	end try
end tell

It seems to work. But now you have to set FS=“;” when you use awk.

difference between excel 2008 and 2011 is opening:
2008 it opens as macroman
2011 import the csv into cell A1 and select the correct encoding, delimiter and column types so you’re sure all data is in your sheet.

When it shows wrong in excel it is probably the correct encoding, because utf-8 CSV files should be displaying special characters wrong. Set ‘opening character encoding’ of text edit’s preferences to utf-8 and open the CSV file there, if the characters are displayed correctly, the encoding is correct. Because you said that my script with CRLR (return followed with a linefeed) is the correct line terminator I assumed that the encoding you got is Windows. Not certain if it’s UTF-8 or not already but here a script where you can change the first two variable values to the correct encodings.

set fromEncoding to "CP1252"
set toEncoding to "UTF-8"

set iconvCommand to " | iconv -f " & fromEncoding & " -t " & toEncoding & space
set theFiles to choose file with multiple selections allowed
set fileName to text returned of (display dialog "Type file name:" default answer ".csv")
set saveFile to POSIX path of (((path to desktop) as text) & fileName)

do shell script "cat " & quoted form of POSIX path of item 1 of theFiles & iconvCommand & "> " & quoted form of saveFile
if (count of theFiles) = 1 then return
repeat with fileToAdd in theFiles
	set x to quoted form of POSIX path of fileToAdd
	do shell script "cat " & x & " | awk 'BEGIN {RS=\"\\r\\n\"; getline}{print $0}' " & iconvCommand & ">> " & quoted form of saveFile
end repeat

Now we assume that the encoding of the selected files is CP1252 (Windows latin 1) and that the output file, the merged CSV file, is UTF-8 encoded.

We’re working with CSV files and automate these actions for CMS systems of world leading WebShops and we know the drawbacks (read my post in previous posts). How about fields that needs text indicators (quoting), does the CMS support that as well? You don’t want that the lines shifts and that the wrong copy or price will be related to a product.

Hmm tried on the file and everything is merged into a column, and only breaks for the comma :smiley:

Does it happen to you?

No you won’t. print$0 prints the whole line without considering the FS at all

You may see what I get here thru :

https://www.box.com/s/qvve9u8zj0nhf9nfbeug

I’m puzzled by McUsr tests for the first value in a row.
I assume that he is trying to take care of a thousands separator but I’m not sure.

Yvan KOENIG (VALLAURIS, France) samedi 8 septembre 2012 17:14:50

DJ Bazzie Wazzie, I get

error "
iconv: (stdin):1:21: cannot convert" number 1

The fromEncoding should match the encoding of the file you selected. Then you won’t get an error, now it was on the first line character 21 that couldn’t be converter. probably a ‘é’ character or something.