The “textutil” utility is known to be good at converting text files from one encoding to another. The general syntax of the decoding process is as follows:
do shell script "/usr/bin/textutil -convert txt -encoding UTF-16 -inputencoding windows-1256 " & quoted form of POSIX path of (choose file)
The man-page of this utility says that if you do not specify the “-inputencoding” option, then the command will determine the encoding of the source text file itself, from its BOM:
do shell script "/usr/bin/textutil -convert txt -encoding UTF-16 " & quoted form of POSIX path of (choose file)
There are fastestEncoding and smallestEncoding properties in NSString objects. And the method stringWithContentsOfFile:usedEncoding:error:
which is probably what TextUtil uses (no guarantees, though)
by default, a file’s encoding will be determined from its BOM
which is … interesting. It simply means (I think) that if there’s a BOM in the file, it is used to determine the encoding (for which no one would need a CLI tool like that). The interesting question remains: What happens if there is no BOM?
I just saved a short text file from TextEdit, once with UTF-8 and once with UTF-16. Both variants didn’t contain a BOM, and textutil -info didn’t say anything about the encoding. I don’t know if -info does so, if there’s a BOM, though.
I also tried with an HTML file that has a meta charset=utf8 element, and textutil didn’t say anything abouth the encoding with -info. And it does apparently detect that (or just assume it?), since the conversion to Latin-1 works as expected, including a new charset attribute.
Why am I interested in this. Because it seems to me that textutil uses some kind of “cunning” algorithm to determine the file encoding. Similar to the NoteBook application on Windows systems.
I would be interested to know if this utility always successfully encodes a file with an unspecified source file encoding? Has anyone come across a case where textutil would fail to transcode a file successfully?
If textutil, like NoteBook, always recognizes correctly, then it would be ideal if this recognized encoding is obtained.
I suspect -initWithURL:options:documentAttributes:error: from NSAttributedString is more likely. Then it could output other formats. Back in the days when the code for TextEdit was published, it used something like that too.