UTF-8 vs. Other encoding problem

I wrote a program that automates the process of getting lyrics for the selected iTunes song.

The problem is, the final lyrics look like this:

Which I recognize is an encoding problem. In order to get the lyrics off of the website, I download the entire webpage source, extract everything between a startTag and endTag and then strip all the HTML using this code:


set plaintextLyrics to my (removeMarkup from strippedLyrics)

to removeMarkup from someText -- strip HTML using textutil
	set someText to quoted form of ("<!DOCTYPE HTML PUBLIC>" & someText) -- fake a HTML document header
	return (do shell script "echo " & someText & " | /usr/bin/textutil -stdin -convert txt -stdout") -- strip HTML
end removeMarkup

I have seen discussions of how to change encodings for files that are written to disc, but nothing on what to do on a string within AppleScript.

Is there a way to find/replace or otherwise fix the offending apostrophe and any other characters that are messed up?

Hi.

Using textutil’s -inputencoding option to tell it the input’s UTF-8 seems to help:


set plaintextLyrics to my (removeMarkup from "What's going on in that beautiful mind<p>
I'm on your magical mystery ride<p>
"Posh quotes"<p>
La Niña<p>
Łódź")

to removeMarkup from someText -- strip HTML using textutil
	set someText to quoted form of ("<!DOCTYPE HTML PUBLIC>" & someText) -- fake a HTML document header
	return (do shell script "echo " & someText & " | /usr/bin/textutil -stdin -inputencoding UTF-8 -convert txt -stdout") -- strip HTML
end removeMarkup

Thank you so much Nigel!

I had come across the -UTF-8 option of textutil, but apparently I had mangled the syntax. This version works perfectly (until I find some lyrics site that requires ISO8859-1, and the UTF-8 encoding messes it up and needs to be deleted… why don’t songs just come with lyrics?).

Thank you once again! I hope this thread helps someone in the future.