Textutil can determine .txt encoding automatically?

KniazidisR · July 14, 2023, 12:17pm

Hello,

The “textutil” utility is known to be good at converting text files from one encoding to another. The general syntax of the decoding process is as follows:

do shell script "/usr/bin/textutil -convert txt -encoding UTF-16 -inputencoding windows-1256 " & quoted form of POSIX path of (choose file)

The man-page of this utility says that if you do not specify the “-inputencoding” option, then the command will determine the encoding of the source text file itself, from its BOM:

do shell script "/usr/bin/textutil -convert txt -encoding UTF-16 " & quoted form of POSIX path of (choose file)

chrillek · July 14, 2023, 12:20pm

And then there’s iconv, the classic Unix tool.

KniazidisR · July 14, 2023, 12:38pm

Thank you,

I know this utility too, but I am interested in the automatic (programmatic) definition of the original text encoding.

Does textutil determine the source encoding STABLY?
Is it possible to get the result of its definition from textutil in a script - i.e. the found SOURCE ENCODING itself?

chrillek · July 14, 2023, 12:58pm

There are fastestEncoding and smallestEncoding properties in NSString objects. And the method
stringWithContentsOfFile:usedEncoding:error:
which is probably what TextUtil uses (no guarantees, though)

Perhaps this helps: Reading Strings From and Writing Strings To Files and URLs

And the man page of textutil has this to say:

by default, a file’s encoding will be determined from its BOM

which is … interesting. It simply means (I think) that if there’s a BOM in the file, it is used to determine the encoding (for which no one would need a CLI tool like that). The interesting question remains: What happens if there is no BOM?

I just saved a short text file from TextEdit, once with UTF-8 and once with UTF-16. Both variants didn’t contain a BOM, and textutil -info didn’t say anything about the encoding. I don’t know if -info does so, if there’s a BOM, though.

I also tried with an HTML file that has a meta charset=utf8 element, and textutil didn’t say anything abouth the encoding with -info. And it does apparently detect that (or just assume it?), since the conversion to Latin-1 works as expected, including a new charset attribute.

KniazidisR · July 14, 2023, 3:43pm

Why am I interested in this. Because it seems to me that textutil uses some kind of “cunning” algorithm to determine the file encoding. Similar to the NoteBook application on Windows systems.

I would be interested to know if this utility always successfully encodes a file with an unspecified source file encoding? Has anyone come across a case where textutil would fail to transcode a file successfully?

If textutil, like NoteBook, always recognizes correctly, then it would be ideal if this recognized encoding is obtained.

Shane_Stanley · July 14, 2023, 11:41pm

I suspect -initWithURL:options:documentAttributes:error: from NSAttributedString is more likely. Then it could output other formats. Back in the days when the code for TextEdit was published, it used something like that too.

Piyomaru · July 15, 2023, 3:03am

I wrote Automatic text encoding detection AppleScript in 5 years ago.
It works for Japanese text files.

http://piyocast.com/as/archives/1014

CJK · July 15, 2023, 9:08pm

With respect to command line tools, the text encoding of a file can be obtained like so:

file --brief --mime-encoding -- <path>

dedalus · September 13, 2024, 5:06am

Or simply using command file with capital “I” :

file -I <path>
/Users/apple/Desktop/myFile.rtf: text/rtf; charset=us-ascii