XML handling question

I’m writing an application where I create an XML document that I can either save to disk or send to a webservice. The document needs to have ISO-8859-1 as the encoding and using XMLLib as the XML parser I’ve gotten the save to disk part working perfectly. The problem is when I’m gonna send it to webservice as I then send the XML data as a string, and the command in XMLLib for returning an XML object as a string (XMLDisplayXML) ignores the specified encoding and always uses UTF-8. I tried using do shell script: cat as the string producer instead but it cannot handle certain characters and prints them as questionmarks. (The same characters that force me to use ISO-8859-1 to begin with.) I also tried XML Tools which seems like it should work, but no matter what I do it just won’t generate the string I want.

set temp_path to ((path to startup disk) as text) & "temp.xml"
set theXML to parse XML alias temp_path encoding "ISO-8859-1"
(*temp.xml already has the correct encoding but for some reason XML Tools doesn't recognize it. The "encoding" parameter is supposed to set the encoding to whatever I specify regardless of this, however.*)
set xmldata to generate XML theXML with including XML declaration

It’s the xmldata string that I send to the webservice. Anyway, for some reason the encoding part doesn’t get included. I tried a more complicated code:

set temp_path to ((path to startup disk) as text) & "temp.xml"
set theXML to parse XML alias temp_path
set tempxmldata to generate XML theXML without including XML declaration
set theXML to {XML encoding:"ISO-8859-1", XML tag:tempxmldata}
set xmldata to generate XML theXML with including XML declaration

Which does everything correctly apart from adding </> to the outside of the string. That is, it generates:

"<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>
<

datagoeshere
moredatagoeshere

   etc. etc.
[b]/>[/b]"

When it should generate the same thing without the (by me) bolded characters.

Any help? The only thing I need to do is to generate XML data as a string while having the encoding of my choice (ISO-8859-1). It really shouldn’t be this troublesome.

I think you can use “XMLSetEncoding” with XMLLib, then “XMLSave” to disk, then send to the webservice the result of:

read alias "path:to:file.xml"

(instead of using XMLDisplayXML)

Thanks for the reply. That seemed like the perfect solution, but for some reason it translates the ® (registered trademark) characters in my XML data to Æ (combined A and E).

I did get the code to work by using XML Tools to first generate the data without the XML declaration and then adding it to the string manually:

set xmldata to generate XML theXML without including XML declaration and pretty printing
set xmldata to "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>" & return & xmldata

But I would prefer if I could do it without using XML Tools. (Seems unnecessary to distribute an extra scripting addition with the script for just this one thing.) Is there any way you can make the read command accept ® characters as well? Or if I could take the string gotten by XMLDisplayXML (which does display the trademark characters until you send it to the webservice) and then manually changing the first line from “<?xml version=\"1.0\" encoding=\"UTF-8\"?>” to “<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>”. I tried “set first line of…” but for some reason it considers each character in the document to be on a separate line. I tried changing the text item delimiters to return and changing the first text item but then it considers the entire document to be the first text item delimiter. (It doesn’t even seem like you can change specific text items of something anyway.) Does anyone know of an efficient way of using string handling to change the XML declaration line from encoding UTF-8 to encoding ISO-8859-1?

Yes. “Æ” is the “ISO-8859-1” equivalent to the “x-mac-roman” char “®”. So, if you pass a “ISO-8859-1” encoded document to whatever platform, it will be allways interpreted as “®”. On the other hand, if you say your XML is “ISO-8859-1” encoded, and you pass “®” instead of “Æ” (that is, a mac-ascii string against iso-latin-1 string), it will be interpreted as “¨”.

Try this. Open this in Firefox, Dreamweaver or similar (somebody which can read properly XML files, not as plain ASCII):

You will see that the thing between “data” tags is displayed as “®”.

I created the document you suggested with the following code (I tried saving it directly with the Text Editor but it wouldn’t work):

set the_xml to XMLOpen "<root>
   <element>
       <data>Æ</data>
   </element>
</root>"
XMLSetEncoding the_xml to "ISO-8859-1"
set temp_path to ((path to startup disk) as text) & "test.xml"
XMLSave the_xml in file temp_path
XMLClose the_xml

Opening the document with Text Editor I get

Opening with Firefox I get

And reading it with read alias I get:

Using the read alias code in my original application I get the ® character as Æ in the document passed through the webservice. What am I doing wrong? Using the ¨ character in place of Æ causes read alias to generate a ®, but I don’t know how to apply this in my own application since I can’t really edit the xml document to replace the ® with ¨ before using read alias.

Well, I don’t know what kind of data is expecting your target webservice. But if you use ISO-8859-1 as encoding in your <?xml declaration, I understand you must pass your text encoded as ISO-8859-1 (which means that “®” will be passed as “Æ”, “á” as “·”, or “ç” as “Á”).

Teorically, the webservice will read these bytes and un-encode the text from ISO-8859-1 to its own encoding (eg, mac-roman, utf-8, ISO-8859-1 itself…).

Just forget the webservice for a minute. For example, if you write in a plain text editor “®”, and you pass this file to a Windows machine, then open it in Notepad, you will see “Æ” instead of “®”.

If you need, though, that it shows as “®” in the Windows machine, you must write “¨” in your Mac machine. If you do so, the Windows machine will read it as “®”.

You’re only passing in your text file the equivalent to “ASCII number X” (not a real “®” character). This way, the Mac machine will read ASCII 168 as “®”, while the Win machine will read it as “Æ”.

So, in a XML document, you tell the interpreter to read your text as “something”. If you say it’s ISO-8859-1, then it will read “®” when you pass “¨”. If you say it’s mac-roman, it will read “¨”.

I would test the various combinations with the webservice, passing allways the result of (read alias “path:to:file.xml”) and the various encodings (specially ISO-8859-1, if you say this is what the webservice is waiting for).

Well, the webservice is normally expecting UTF-8 encoded documents, it’s just that it has only received them from PCs up to now and when I send them from my newly created Mac client I have to change the encoding or the ® characters in them cause a problem for the receiving application.

I’d try some different combinations, but the XML-document is created at runtime (it contains a list of all installed software on the computer, and some of them have ® characters in their names) and if I had to manually replace the ® characters with Æ or ¨ it would be more work than necessary as I like I mentioned managed to solve the problem with XML Tools. (Unless of course there are some efficient routines in Applescript for replacing all instances of a specific character in a text with another character, but I don’t know any personally. I’d rather not loop through every character in the document as it’d take unnecessarily long time.)

Thanks for your help in either case, I’m learning some things about encodings here.

You should look at this thread: Find & Replace

Thanks, that’s perfect!