Change file encoding of TextEdit by Applescript

Endianness, for UTF-16, means how you can store multiple bytes to store higher numbers. As humans we combine numbers as well because after the number 9 we write the next number down as 10. This notation means big endian while for certain CPUs it can be faster to store number the other way around and you will store number ten as 01. So for bytes the number 261 will be stored as 0x0104 in big endian notation while it will be stored as 0x0401 in little endian. The best way to remind the difference is that big endian is similar to human notation.

Most of the time when the endian is wrong you will see all sort of Vietnamese symbols when opening the file

for example unicode character 100 will be stored as 0x0064 or 0x6400. When using the wrong endian the number (unicode character) 100 will be interpreted as an number (unicode character) 25600 which is character 𥘀.

DJ Bazzie Wazzie,

would it be possible for you to write a script that converts a .txt file from UTF-8 into UTF-16 (big or/and little endian)?

I would just try to import a txt file with a UTF-16, big or little endian, and see if it works in FileMaker…

:slight_smile:

I am surprised that bare utf-16 is little endian. :slight_smile: And I’d love to read your paper, if you care to share.

@ToBeJazz:

Try DJ Bazzie Wazzies shell commands in post 7, then import it into Filemaker manually, and see if it gives the results you need. If you get Japanese/Chinese or Korean as a result, then you are having the wrong endianess! :slight_smile:

set theFile to POSIX path of (choose file)
set encoding to choose from list {"UTF-16 Big Endian", "UTF-16 Big Endian + Bom", "UTF-16 Little Endian", "UTF-16 Little Endian + Bom"}

if encoding is false then
	return --nothing selectedor pressed cancel
else
	set encoding to encoding as string
end if

set newFileName to (do shell script "str=" & quoted form of theFile & ";echo ${str%.*}") & "_iconv.txt"

if encoding contains "Big Endian" then
	set enc to "UTF-16BE"
	set cmd to ""
	if encoding contains "+ Bom" then set cmd to "xxd -p -r <<< xfeff "
	do shell script cmd & " > " & quoted form of newFileName
else
	set enc to "UTF-16LE"
	set cmd to ""
	if encoding contains "+ Bom" then set cmd to "xxd -p -r <<< xfffe"
	do shell script cmd & "> " & quoted form of newFileName
end if

do shell script "iconv -f UTF-8 -t " & enc & space & quoted form of theFile & " >> " & quoted form of newFileName

Hey that’s a great script DJ Bazzie Wazzie - thanks alot!

When I choose “UTF-16 Big Endian + Bom” it does just what I’m looking for.
Let’s see if I can make a script that has no encoding options and no choose file option as well - I want a specific script without any options to be run from FileMaker so that I can import a text file that was originally in UTF-8.

You’re welcome!

Great! And sad at the same time. It’s the only encoding that iconv and cocoa text system can’t write to a file.

I’ve changed my first post (cleaned up the mess I’ve made) and that example code should help you. I also noticed that echo -e -n doesn’t quite work as good as in the terminal. Haven’t figured out what exactly goes wrong there but xxd does it job very well, in Terminal as in do shell script.

Sorry I can’t use your short example script, don’t know how to change it…
I did begin to shorten your longer script though:

set theFile to POSIX path of (choose file)
set encoding to "UTF-16 Big Endian + Bom" as string
set newFileName to (do shell script "str=" & quoted form of theFile & ";echo ${str%.*}") & "_iconv.txt"

set enc to "UTF-16BE"
set cmd to "xxd -p -r <<< xfeff "
do shell script cmd & " > " & quoted form of newFileName

do shell script "iconv -f UTF-8 -t " & enc & space & quoted form of theFile & " >> " & quoted form of newFileName

Do really know what you mean by “iconv and cocoa text system can’t write to a file.” I have no problem using the above script, it does what it is supposed to do.

Sorry for the confusing up here… the script is working but I’m helping iconv to startup because it can’t write the BOM on it’s own. With cocoa text system, including Texteditor, it’s impossible to save the file in a proper way.

You mean something like this?

set theFile to POSIX path of (choose file)
set newFileName to (do shell script "str=" & quoted form of theFile & ";echo ${str%.*}") & "_iconv.txt"

do shell script "xxd -p -r <<< xfeff > " & quoted form of newFileName
do shell script "iconv -f UTF-8 -t UTF-16BE " & space & quoted form of theFile & " >> " & quoted form of newFileName

Yes, that’s short and nice I think.
Next for me is to get rid of the choose file thing and point directly to a file, but at least that I should be able to do myself:)

IMHO The people behind FileMaker should receive a copy of this thread.

And there seem to lack the fine print regarding the do shell script too. It is obviously interpreting stuff, and it would have been nice, if they specified exactly how input and output from the do shell script command is treated/translated. Because it isn’t much we can do about it. I mean, stty settings doesn’t work, when you don’t have a terminal…

I read this topic for a solution but it’s not working for me because if source file in UTF-8 without BOM then encoded file goes with error.
In my case I need to import .csv file into excel, but some characters imports in ISO-8859-1, so the solution is to encode file to UTF-16LE with BOM.

I tried to add BOM into UTF-8 first and then encode it to UTF-16 with BOM, and it works, but there are two steps, two encoded files and I don’t enjoy it.
Then I found a solution that works for me, so I’d like to share my experience:

In terminal I found similar command called “uconv” but it’s not available direct in shell (command not found error), so I should link to path:

on run {input, parameters}
	
	set theFile to POSIX path of input --source file
	set endFileName to (do shell script "str=" & quoted form of theFile & ";echo ${str%.*}") & "_b.csv" --temp file
	
	do shell script "/opt/local/bin/uconv -s -f UTF-8 -t UTF-16LE --add-signature < " & quoted form of theFile & " > " & quoted form of endFileName --uconv silent from utf-8 to utf-16 little endian with bom from source file to temp file
	do shell script "mv " & quoted form of endFileName & space & quoted form of theFile --replace source file by temp file

	return input
end run

This code works with file from input, encode it from UTF-8 to UTF-16 Little Endian with BOM (–add-signature for that) and replace source file by new one.
Use man uconv, to read about it, uconv --list lists the different encodings.

Hi BullyBu. Welcome to MacScripter and thanks for posting your own solution to this topic.

There’s no “/opt” folder on my machine. Were it and its contents installed on yours by some third-party software?

I didn’t found uconv on my machine.
The job may be done with iconv.

set theFile to POSIX path of (choose file)

set newFileName to (do shell script "str=" & quoted form of theFile & ";echo ${str%.*}") & "_iconv.txt"

set enc to "UTF-16BE"
set cmd to "xxd -p -r <<< xfeff "
do shell script cmd & " > " & quoted form of newFileName # write the BOM : FE FF in the new file

do shell script "iconv -f UTF-8 -t " & enc & space & quoted form of theFile & " >> " & quoted form of newFileName # write the UTF16-BE encoded text after the BOM

I tried to play with ASObjC but I’m puzzled.

In Xcode Help I read :
[format]NSUTF16BigEndianStringEncoding
NSUTF16StringEncoding encoding with explicit endianness specified.[/format]
My understanding was that using this encoding I will get a file with the Big Endian BOM at beginning.
Alas I was wrong.

The code (most of which was borrowed to Shane STANLEY) used is :

use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

on modifyPath:thePath adding:addString
	set pathString to current application's NSString's stringWithString:thePath
	set theExtension to pathString's pathExtension()
	set thePathNoExt to pathString's stringByDeletingPathExtension()
	set newPath to (thePathNoExt's stringByAppendingString:addString)
	if theExtension's |length|() > 0 then
		set newPath to newPath's stringByAppendingPathExtension:theExtension
	end if
	return newPath as string
end modifyPath:adding:

on decodeFile:thePath
   set theString to current application's NSString's stringWithContentsOfFile:thePath encoding:(current application's NSISOLatin1StringEncoding) |error|:(missing value)
   set newPath to my modifyPath:thePath adding:"-new"
   set theResult to theString's writeToFile:newPath atomically:true encoding:(current application's NSUTF16BigEndianStringEncoding) |error|:(missing value)
   return theResult as boolean
end decodeFile:

set theSource to (choose file)
my decodeFile:(POSIX path of theSource)

Is there something wrong in it or am I wrongly understanding what applying NSUTF16BigEndianStringEncoding is supposed to do ?

Yvan KOENIG running Sierra 10.12.3 in French (VALLAURIS, France) mardi 14 mars 2017 15:56:27

Hi Yvan.

I think “with explicit endianness specified” is just an explanation that the enum NSUTF16BigEndianStringEncoding is used to specify explicitly that the text is to be saved with UTF-16 big-endian encoding, not with the endianness native to the machine. It’s an explicit instruction to writeToFile rather than an instruction to include an explicit BOM in the file. Maybe Shane will confirm this when he gets up.

According the the Xcode documentation, this enum was only introduced with MacOS 10.12, but it works on my 10.11 system.

iconv command line util works with stdin and stdout, meaning you can pipe it directly from one encoding to another without the need of creating additional temporary files.

Great to see other solutions even with third party command line utils :cool: I’m still just curious what went wrong with the code above in post #12.

Thanks Nigel.

I tried to use an awful scheme to insert the BOM.

use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

on modifyPath:thePath adding:addString
	set pathString to current application's NSString's stringWithString:thePath
	set theExtension to pathString's pathExtension()
	set thePathNoExt to pathString's stringByDeletingPathExtension()
	set newPath to (thePathNoExt's stringByAppendingString:addString)
	if theExtension's |length|() > 0 then
		set newPath to newPath's stringByAppendingPathExtension:theExtension
	end if
	return newPath as string
end modifyPath:adding:

on decodeFile:thePath
	set theString to (current application's NSString's stringWithString:" ")
	set moreString to current application's NSString's stringWithContentsOfFile:thePath encoding:(current application's NSISOLatin1StringEncoding) |error|:(missing value)
	set theString to theString's stringByAppendingString:moreString
	set newPath to my modifyPath:thePath adding:"-new"
	set theResult to theString's writeToFile:newPath atomically:true encoding:(current application's NSUTF16BigEndianStringEncoding) |error|:(missing value)
	return {newPath, theResult as boolean}
	
end decodeFile:

set theSource to (choose file)
set {newPath, bof} to my decodeFile:(POSIX path of theSource)
set newPath to newPath as «class furl»
set openFile to open for access newPath with write permission
write «data rdatFEFF» to openFile starting at 0
close access openFile

TextWrangler and BBEdit open the resulting file flawlessly but alas, TextEdit crashes.
If I open with TextWrangler then save with an other name, the newly saved file opens flawlessly in TextEdit.
Puzzling isn’t it ?

Yvan KOENIG running Sierra 10.12.3 in French (VALLAURIS, France) mardi 14 mars 2017 17:51:41

That’s right.

Actually, I just noticed this in Wikipedia, FWIW:

The documentation is wrong (it says the same thing about NSASCIIStringEncoding :mad:). I believe it was introduced in 10.4.

Mmm… :confused: How about this?

use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

on modifyPath:thePath adding:addString
	set pathString to current application's NSString's stringWithString:thePath
	set theExtension to pathString's pathExtension()
	set thePathNoExt to pathString's stringByDeletingPathExtension()
	set newPath to (thePathNoExt's stringByAppendingString:addString)'s stringByAppendingPathExtension:theExtension
	return newPath as string
end modifyPath:adding:

on decodeFile:thePath
	-- Get the BOM value as a two-character string. (The single character id (254 * 256 + 255) gets lost in the conversion to NSString.)
	set theUTF16BEBOM to current application's NSString's stringWithString:(string id {254, 255})
	-- Convert it to two bytes of data.
	set theData to (theUTF16BEBOM's dataUsingEncoding:(current application's NSISOLatin1StringEncoding))'s mutableCopy()
	-- Read the contents of the ISO Latin 1 text file.
	set theString to current application's NSString's stringWithContentsOfFile:thePath encoding:(current application's NSISOLatin1StringEncoding) |error|:(missing value)
	-- Convert that to data too, but encoded as UTF-16 big-endian, and append it to the BOM data.
	tell theData to appendData:(theString's dataUsingEncoding:(current application's NSUTF16BigEndianStringEncoding))
	-- Write the lot to a new file.
	set newPath to my modifyPath:thePath adding:"-new"
	set theResult to theData's writeToFile:newPath atomically:true
	
	return {newPath, theResult as boolean}
end decodeFile:

set theSource to (choose file)
set {newPath, bof} to my decodeFile:(POSIX path of theSource)

Nice :slight_smile:

An alternative that might work would be create the BOM as a zero width no-break space. Unfortunately this only works in 10.11 and above:

set theString to current application's NSString's stringWithString:"\\N{ZERO WIDTH NO-BREAK SPACE}"
set theString to theString's stringByApplyingTransform:(current application's NSStringTransformToUnicodeName) |reverse|:true

You could then append the contents of the file to that string, and save. I don’t have a suitable sample to test it.

Thank you. :slight_smile: And for your previous reply.

That works for me if I convert both to data before appending them, as in my version above. But if I append them as NSStrings and save the result, the resulting file crashes anything that tries to open or read it. (Well. TextEdit and a ‘read (choose file) as Unicode text’’ script anyway.) Sounds similar to what Yvan was getting with his version. I’ll have another look in the morning (GMT).