text encoding problem?

regulus6633 · May 25, 2008, 10:30am

I have some text that gets read into applescript and it is causing me problems. I believe it’s in a different text encoding but I’m not sure. Here’s what happens…

If you run this script

set a to "you"
set b to text items of a

result → {“y”, “o”, “u”}

That result is expected. But this particular text I’m talking about looks like the word “you” on the screen but when I run that applescript on it I get this…
result → {“y”, “”, “o”, “”, “u”}

There’s an extra character between each letter. It must be like a null character. What’s the fastest way I can convert this to something applescript can handle? I have tons of text to handle and not all of the text is like this, so it would be a real drag on my program if I have to check every piece of text when probably only about 5% will be this kind.

I can do it the hard way and use text item delimiters to get rid of the characters but i don’t want to have to check every piece of text. It will really slow me down.

StefanK · May 25, 2008, 10:41am

Hi,

the text with the “extra bytes” is probably Unicode text (UTF-16) which uses two bytes per character,
while your script expects MacRoman oder Windows-Latin (one byte)
There is no common solution, because I don’t know, where the text comes from
and it depends on the system version. Unlike all pre-Leopard versions, AppleScript in Leopard is completely Unicode based.

If you’re using the read command in your script, try

read file "Path:to:file.txt" as Unicode text

regulus6633 · May 25, 2008, 10:55am

Thanks for the input. If you say there’s not set way to convert it then I believe you because I’ve seen you answer other text encoding issues before. That’s too bad. I had tried reading as unicode text and it didn’t work. I guess my only recourse is to use text item delimiters on each piece of text. Man that really hurts. For a problem that only affects a few items it’s going to drag everything down.

Here’s my solution. If you see any improvements let me know. I was able to copy the “” character so I can use it to remove it with the following script. Note: the first “” in my script is actually the copied character, not a blank as usual. And also note that I couldn’t put the actual character in this script because it caused problems when I submitted this post… I had to remove it!

on convertCharEncoding(theText)
	script g
		property b : missing value
		property ti : missing value
	end script
	set g's b to theText
	set text item delimiters to ""
	set g's ti to text items of g's b
	set text item delimiters to ""
	return g's ti as text
end convertCharEncoding

StefanK · May 25, 2008, 11:47am

I didn’t say, there is no way to convert it.
The shell provides wonderful methods to convert text,
but first you should know, what kind text encoding your text has

regulus6633 · May 25, 2008, 11:50am

There’s no code that I can see in the text document to tell me. Is there some way I can find out? And do you think a shell command would be faster that my text item delimiters?

StefanK · May 25, 2008, 11:58am

if it is a text file, type file -b in Terminal (with a space at the end) and drag the file in the terminal window.
Press return. Maybe the encoding will be displayed.

I don’t know, if a shell command would be faster than your code, but the text could be converted while being read

regulus6633 · May 25, 2008, 12:51pm

It’s not a text file, it’s an applescript. I’m using osadecompile to read the code as text. I released an application today at MacUpdate and other places using this technology. You can see it here if you’re interested…
http://www.hamsoftengineering.com/products/scriptlight/scriptlight.html

Anyway, while writing my app I found a couple applescripts within the scripts Apple supplies us in /Library/Scripts have this problem. One I found is located at /Library/Scripts/Address Book Scripts/Import Addresses.scpt. I checked a few computers and not all of them have this problem with this script… which is strange but a few did. osadecompile returns the text to me in either case but the bad scripts affect me when I use the text. For now I’m using the method I described to work with the problem but a better method would be great.

So I tried your file -b command on a script and it shows up as data. Any other ideas now that you know more about what I’m doing?

StefanK · May 25, 2008, 1:30pm

I decompiled your mentioned Address Book script into a text file and opened it with a Hex Editor.
Very strange, there are two mixed encodings: the block comments are UTF16, the actual code is a 8 bit code.

I guess, your way to filter the null bytes is probably the best

regulus6633 · May 25, 2008, 1:48pm

Thank’s for looking at that.

That’s funny about only the comments being UTF-16. Probably somebody pasted the comments in from a different source other than Script Editor. That problem gave me headaches. I was messing with my code when I stumbled on it and so at first I thought I had caused the problem. It took a long time before I realized something else was going on. And I’m glad you were able to see it. At least I know I’m not crazy. On a fast computer that code I’m using is hardly noticeable, so I’m not in too bad shape with it.

Thanks again.

Craig_Williams · May 30, 2008, 6:52pm

Try this on some problem scripts and see if it works.
I had success on the Address Book script.

set the_path to POSIX path of (path to desktop as Unicode text) & "test.scpt"

set y to do shell script "osadecompile " & quoted form of the_path & " | /usr/bin/ruby -e 'puts STDIN.read.gsub(/\\000/, \"\")'"

Cheers,

Craig

Adam_Bell · May 30, 2008, 7:04pm

As things now stand, we are forced to deal with UTF-8 (with and without BOM), UTF-16 (again with and without BOM), and ASCII (7-bit and 8-bit). The first person who writes a “detector/converter” will do us all a service.

regulus6633 · May 30, 2008, 8:36pm

Hi Craig, thanks. I tried and it works. So then I decided to try a speed test to see if it was faster than my method. I did a “lotsa” test of 200 hundred iterations on each method. Here’s the results:

200 iterations using that address book script as the sample script
Using osadecompile and ruby: 68.702 secs
Using osadecompile and my subroutine: 72.656 secs

So about a 4 second difference on 200 scripts. Normally I would use your method considering the results, but I’m not good with ruby so I’m afraid I couldn’t troubleshoot and adjust it if I had problems. Therefore since the tests were relatively close, and considering I understand my code, I think I’ll stick to my method for now. If there was a bigger difference in the time then I might change.

But I can see you’re learning ruby like you said! Did you run into any text encoding issues on your stuff?

==Update==
I just ran the osadecompile process 200 times on that script by itself without any conversion and the time was: 67.948 secs
So obviously osadecompile contributes most of the time. Also we can conclude that the ruby call contributes almost nothing to the time. My code adds a small hit to each script. But in this case it’s the combined time that matters to me so I’ll accept the small time penalty for using my code because I can understand and work with it.