I have some text that gets read into applescript and it is causing me problems. I believe it’s in a different text encoding but I’m not sure. Here’s what happens…
If you run this script
set a to "you"
set b to text items of a
result → {“y”, “o”, “u”}
That result is expected. But this particular text I’m talking about looks like the word “you” on the screen but when I run that applescript on it I get this…
result → {“y”, “”, “o”, “”, “u”}
There’s an extra character between each letter. It must be like a null character. What’s the fastest way I can convert this to something applescript can handle? I have tons of text to handle and not all of the text is like this, so it would be a real drag on my program if I have to check every piece of text when probably only about 5% will be this kind.
I can do it the hard way and use text item delimiters to get rid of the characters but i don’t want to have to check every piece of text. It will really slow me down.
Hi,
the text with the “extra bytes” is probably Unicode text (UTF-16) which uses two bytes per character,
while your script expects MacRoman oder Windows-Latin (one byte)
There is no common solution, because I don’t know, where the text comes from
and it depends on the system version. Unlike all pre-Leopard versions, AppleScript in Leopard is completely Unicode based.
If you’re using the read command in your script, try
read file "Path:to:file.txt" as Unicode text
Thanks for the input. If you say there’s not set way to convert it then I believe you because I’ve seen you answer other text encoding issues before. That’s too bad. I had tried reading as unicode text and it didn’t work. I guess my only recourse is to use text item delimiters on each piece of text. Man that really hurts. For a problem that only affects a few items it’s going to drag everything down.
Here’s my solution. If you see any improvements let me know. I was able to copy the “” character so I can use it to remove it with the following script. Note: the first “” in my script is actually the copied character, not a blank as usual. And also note that I couldn’t put the actual character in this script because it caused problems when I submitted this post… I had to remove it!
on convertCharEncoding(theText)
script g
property b : missing value
property ti : missing value
end script
set g's b to theText
set text item delimiters to ""
set g's ti to text items of g's b
set text item delimiters to ""
return g's ti as text
end convertCharEncoding
I didn’t say, there is no way to convert it.
The shell provides wonderful methods to convert text,
but first you should know, what kind text encoding your text has
There’s no code that I can see in the text document to tell me. Is there some way I can find out? And do you think a shell command would be faster that my text item delimiters?
if it is a text file, type file -b in Terminal (with a space at the end) and drag the file in the terminal window.
Press return. Maybe the encoding will be displayed.
I don’t know, if a shell command would be faster than your code, but the text could be converted while being read
It’s not a text file, it’s an applescript. I’m using osadecompile to read the code as text. I released an application today at MacUpdate and other places using this technology. You can see it here if you’re interested…
http://www.hamsoftengineering.com/products/scriptlight/scriptlight.html
Anyway, while writing my app I found a couple applescripts within the scripts Apple supplies us in /Library/Scripts have this problem. One I found is located at /Library/Scripts/Address Book Scripts/Import Addresses.scpt. I checked a few computers and not all of them have this problem with this script… which is strange but a few did. osadecompile returns the text to me in either case but the bad scripts affect me when I use the text. For now I’m using the method I described to work with the problem but a better method would be great.
So I tried your file -b command on a script and it shows up as data. Any other ideas now that you know more about what I’m doing?
I decompiled your mentioned Address Book script into a text file and opened it with a Hex Editor.
Very strange, there are two mixed encodings: the block comments are UTF16, the actual code is a 8 bit code.
I guess, your way to filter the null bytes is probably the best
Thank’s for looking at that.
That’s funny about only the comments being UTF-16. Probably somebody pasted the comments in from a different source other than Script Editor. That problem gave me headaches. I was messing with my code when I stumbled on it and so at first I thought I had caused the problem. It took a long time before I realized something else was going on. And I’m glad you were able to see it. At least I know I’m not crazy. On a fast computer that code I’m using is hardly noticeable, so I’m not in too bad shape with it.
Thanks again.
Try this on some problem scripts and see if it works.
I had success on the Address Book script.
set the_path to POSIX path of (path to desktop as Unicode text) & "test.scpt"
set y to do shell script "osadecompile " & quoted form of the_path & " | /usr/bin/ruby -e 'puts STDIN.read.gsub(/\\000/, \"\")'"
Cheers,
Craig
As things now stand, we are forced to deal with UTF-8 (with and without BOM), UTF-16 (again with and without BOM), and ASCII (7-bit and 8-bit). The first person who writes a “detector/converter” will do us all a service.
Hi Craig, thanks. I tried and it works. So then I decided to try a speed test to see if it was faster than my method. I did a “lotsa” test of 200 hundred iterations on each method. Here’s the results:
200 iterations using that address book script as the sample script
Using osadecompile and ruby: 68.702 secs
Using osadecompile and my subroutine: 72.656 secs
So about a 4 second difference on 200 scripts. Normally I would use your method considering the results, but I’m not good with ruby so I’m afraid I couldn’t troubleshoot and adjust it if I had problems. Therefore since the tests were relatively close, and considering I understand my code, I think I’ll stick to my method for now. If there was a bigger difference in the time then I might change.
But I can see you’re learning ruby like you said! Did you run into any text encoding issues on your stuff?
==Update==
I just ran the osadecompile process 200 times on that script by itself without any conversion and the time was: 67.948 secs
So obviously osadecompile contributes most of the time. Also we can conclude that the ruby call contributes almost nothing to the time. My code adds a small hit to each script. But in this case it’s the combined time that matters to me so I’ll accept the small time penalty for using my code because I can understand and work with it.