NOTE: In the discussion section only (not the handler code), an ASCII 0 null character has been inserted between “&” and the remaining characters of HTML entities to prevent the HTML entities from being decoded into their original Unicode characters by the web browser. This won’t affect the “Open this Scriptlet in your editor” sections, whose Applescript code may be copied and used as is.
The current handler is an HTML decoder written in Applescript. It converts HTML entities in an HTML document to their underlying Unicode characters. HTML entities may be in the form of character reference entities such as &code; , decimal numeric character entities such as © , and hexadecimal numeric character entities such as © , all representing the copyright character © (Unicode code point 169) in the current example. (Note that the current handler is not a URL decoder, with which HTML decoders are sometimes confused. URL decoders handle percent-encoded strings like [b][i]http://ascii.cl?parameter="Click+on+'URL+Decode'!"[/i][/b].)
The handler can decode the following:
- Any valid decimal or hexadecimal numeric character entity
- All 252 character reference entities in the HTML 4 entity set
- The five predefined XML entities " , & , ’ , < , and > , representing the double quotation mark, ampersand, apostrophe, less-than sign, and greater-than sign (Unicode code points 34, 38, 39, 60, and 62, respectively), which apart from ’ are a subset of the HTML 4 entity set
The handler can process input text strings up to at least 1,000,000 characters in length and containing up to at least 8000 HTML entities (the limits of testing thus far). It takes advantage of several features to maximize execution speed: Applescript’s text item delimiters for searching and replacing, the reconstruction of characters from their Unicode code points via the character id … specifier, and hashed access to a master property list of character reference entities.
The handler takes an Applescript record as its input argument with one required and two optional boolean properties:
Required property:
htmlString - the text string to be decoded
Optional boolean properties:
condenseWhitespaces
true - condenses consecutive space, tab, return, and/or linefeed characters (ASCII 32, 9, 13, and 10) in the input text string into a single space character as per the HTML 5 standard
false (the default value if the property is omitted) - preserves white space characters exactly as they appear in the input string as per the HTML 4 standard
decodeAngleBracketsPerHtml5
true (the default value if the property is omitted) - decodes the character reference entities ⟨ and ⟩ as mathematical left and right angle brackets (Unicode code points 10216 and 10217) as per the HTML 5 standard
false - decodes ⟨ and ⟩ as left- and right-pointing angle brackets (Unicode code points 9001 and 9002) as per the HTML 4 standard (now deprecated in the current Unicode standard)
The handler’s return value is the decoded text string.
Several HTML decoding solutions are available on the Mac OS X platform, including Cocoa’s NSAttributedString class used in association with the NSHTMLTextDocumentType attribute, the decode_entities function of Perl’s HTML::Entities module, the unescape function of Python’s HTMLParser module, and PHP’s html_entity_decode function. The current handler offers the following potential advantages:
(1) Offers more granular control over whitespace handling than the alternative solutions (to my knowledge, the Cocoa solution always implements HTML 5 behavior and condenses consecutive whitespace characters)
(2) Does not strip HTML tags, in contrast with the Cocoa solution which strips HTML tags
(3) For input text strings with fewer than about 250 HTML entities to be decoded:
- Executes faster than the Perl, Python, and PHP solutions for input text strings smaller than about 100,000 characters in length
- Executes at speeds similar to the Perl, Python, and PHP solutions for input text strings greater than about 100,000 characters in length
(4) For input text strings with fewer than about 25 HTML entities to be decoded:
- Executes faster than the Cocoa solution for input text strings smaller than about 10,000 characters in length
- Executes at speeds similar to the Cocoa solution for input text strings greater than about 10,000 characters in length
The following are known limitations of the current handler:
(1) Does not decode incompletely formed HTML entities, specifically those lacking the trailing semicolon character (this mimics
the behavior of the PHP decoder)
(2) Does not recognize the over 2000 new named character references in the HTML 5 standard, which represent a superset of the HTML 4 entity set and which seem thus far at least to be largely unimplemented on web pages (the hashed property list required to handle this many items would exceed Applescript’s list size limit; NOTE: the current handler does recognize the newer entities - in fact, all HTML entities - if they are coded in decimal or hexadecimal numeric form)
(3) Executes more slowly than the alternative solutions for input text strings with more than about 250 HTML entities to be decoded (for example, execution times to decode a Wikipedia article ~900,000 characters in length and containing ~4550 HTML entities: Perl ~0.5 sec, Cocoa ~1.5 sec, current Applescript handler ~2.5 sec)
(4) Slows down significantly if the condenseWhitespaces input argument is set to true and the input text string contains a substantial number of long consecutive sequences of whitespace characters
HANDLER EXAMPLES HIGHLIGHTING ALTERNATIVE HTML ENTITY FORMS AND INPUT ARGUMENT SETTINGS:
set leftAngleBracket to "&" & "lang;"
set rightAngleBracket to "&" & "rang;"
set rightArrow to "&" & "#x2794;"
set ellipsis1 to "&" & "hellip;"
set ellipsis2 to "&" & "#8230;"
set ellipsis3 to "&" & "#x2026;"
set spades to "&" & "#9824;"
set hearts to "&" & "hearts;"
set clubs to "&" & "clubs;"
set diamonds to "&" & "#x2666;"
set str to leftAngleBracket & "Suits of a deck of cards" & rightAngleBracket & " " & rightArrow & tab & tab & tab & space & space & space & ellipsis1 & " " & spades & " " & ellipsis2 & " " & hearts & " " & ellipsis3 & " " & clubs & " " & ellipsis2 & " " & diamonds & " " & ellipsis1
decodeHtml({htmlString:str, condenseWhitespaces:false, decodeAngleBracketsPerHtml5:true})
-- or equivalently and taking advantage of default values for optional properties --
decodeHtml({htmlString:str}) -->
"⟨Suits of a deck of cards⟩ ➔ … ♠ … ♥ … ♣ … ♦ …"
decodeHtml({htmlString:str, condenseWhitespaces:false, decodeAngleBracketsPerHtml5:false}) -->
"〈Suits of a deck of cards〉 ➔ … ♠ … ♥ … ♣ … ♦ …"
decodeHtml({htmlString:str, condenseWhitespaces:true, decodeAngleBracketsPerHtml5:true}) -->
"⟨Suits of a deck of cards⟩ ➔ … ♠ … ♥ … ♣ … ♦ …"
decodeHtml({htmlString:str, condenseWhitespaces:true, decodeAngleBracketsPerHtml5:false}) -->
"〈Suits of a deck of cards〉 ➔ … ♠ … ♥ … ♣ … ♦ …"
(* Notes:
- "&" is separated from the remaining characters of the HTML entity in the initial "set" statements to prevent decoding by the web browser
- ellipsis1, ellipsis2, and ellipsis3 are the character, decimal numeric, and hexadecimal numeric forms of the same ellipsis character (Unicode code point 8230)
- Angle brackets are Unicode code points 10216 and 10217 in the first and third output strings, and Unicode code points 9001 and 9002 in the second and fourth output strings
*)
HANDLER EXAMPLE HIGHLIGHTING THE DETECTION OF INVALID HTML ENTITIES:
set validCharEntity to "&" & "copy;"
set invalidCharEntity to "&" & "copy123;"
set validDecEntity to "&" & "#169;"
set invalidDecEntity to "&" & "#A69;"
set validHexEntity to "&" & "#xA9;"
set invalidHexEntity to "&" & "#xZ9;"
set str to "" & ¬
"Valid character entity: " & validCharEntity & return & ¬
"Nonexistent character entity and thus not decoded: " & invalidCharEntity & return & ¬
"Valid decimal numeric entity: " & validDecEntity & return & ¬
"Invalid decimal numeric entity and thus not decoded: " & invalidDecEntity & return & ¬
"Valid hexadecimal numeric entity: " & validHexEntity & return & ¬
"Invalid hexadecimal numeric entity and thus not decoded: " & invalidHexEntity
decodeHtml({htmlString:str, condenseWhitespaces:false}) -->
"Valid character entity: ©
Nonexistent character entity and thus not decoded: ©123;
Valid decimal numeric entity: ©
Invalid decimal numeric entity and thus not decoded: &#A69;
Valid hexadecimal numeric entity: ©
Invalid hexadecimal numeric entity and thus not decoded: &#xZ9;"
(* Note:
- "&" is separated from the remaining characters of the HTML entity in the initial "set" statements to prevent decoding by the web browser
*)
HANDLER:
on decodeHtml(handlerArgument)
(*
- This handler decodes character, decimal numeric, and hexadecimal numeric html entities in an html string
- It takes an Applescript record as its input argument with one required and two optional boolean properties:
Required property:
htmlString - the text string to be decoded
Optional boolean properties:
condenseWhitespaces
true - condenses consecutive space, tab, return, and/or linefeed characters (ASCII 32, 9, 13, and 10) in the input text string into a single space character as per the HTML 5 standard
false (the default value if the property is omitted) - preserves white space characters exactly as they appear in the input string as per the HTML 4 standard
decodeAngleBracketsPerHtml5
true (the default value if the property is omitted) - decodes the character reference entities "lang" and "rang" as mathematical left and right angle brackets (Unicode code points 10216 and 10217) as per the HTML 5 standard
false - decodes "lang" and "rang" as left- and right-pointing angle brackets (Unicode code points 9001 and 9002) as per the HTML 4 standard (now deprecated in the current Unicode standard)
- It returns the decoded html string
*)
-- Utility script
script util
-- Comprehensive character-entity-based hashed list of HTML-4/XHTML entities, each sublist containing a given html entity's character and decimal numeric forms in positions 1 and 2, respectively
-- An entity's index in this list is generated by the hashFunction handler with the entity's character form as the handler argument
-- The decimal numeric forms for "lang" and "rang" are initialized to the null value and will be set during runtime to either their HTML 4 ("9001", "9002") or HTML 5 ("10216", "10217") values depending on the value of the input argument decodeAngleBracketsPerHtml5
property hashedHtmlEntities : {"", "", "", "", "", "", "", "", {"and", "8743"}, "", "", "", "", {"int", "8747"}, "", "", "", "", {"Rho", "929"}, "", "", "", "", "", {"iota", "953"}, "", "", "", {"psi", "968"}, {"prod", "8719"}, "", "", "", {"not", "172"}, {"prop", "8733"}, "", "", "", {"phi", "966"}, {"sdot", "8901"}, {"theta", "952"}, {"Scaron", "352"}, "", {"amp", "38"}, {"ensp", "8194"}, {"Theta", "920"}, {"there4", "8756"}, "", "", {"isin", "8712"}, "", {"thinsp", "8201"}, "", "", "", {"omega", "969"}, {"scaron", "353"}, "", "", "", {"trade", "8482"}, "", "", {"Chi", "935"}, "", {"thorn", "254"}, "", "", {"sup", "8835"}, {"emsp", "8195"}, {"prime", "8242"}, "", "", "", {"sup1", "185"}, {"image", "8465"}, "", "", "", {"supe", "8839"}, {"pound", "163"}, "", "", {"chi", "967"}, {"sup3", "179"}, {"notin", "8713"}, "", "", "", "", {"kappa", "954"}, "", "", {"eta", "951"}, "", {"Kappa", "922"}, {"otilde", "245"}, "", {"cup", "8746"}, {"sup2", "178"}, "", {"atilde", "227"}, {"Mu", "924"}, {"rho", "961"}, {"nbsp", "160"}, {"acute", "180"}, "", "", "", "", "", {"Ntilde", "209"}, {"or", "8744"}, {"loz", "9674"}, "", {"ocirc", "244"}, {"otimes", "8855"}, {"Nu", "925"}, "", "", {"acirc", "226"}, {"ntilde", "241"}, "", {"cap", "8745"}, "", {"icirc", "238"}, "", {"nu", "957"}, "", "", {"ecirc", "234"}, {"oacute", "243"}, "", {"Psi", "936"}, {"sube", "8838"}, {"asymp", "8776"}, {"aacute", "225"}, "", "", "", "", {"iacute", "237"}, "", {"Phi", "934"}, {"euro", "8364"}, {"exist", "8707"}, {"eacute", "233"}, "", "", {"ordm", "186"}, {"alpha", "945"}, {"Yacute", "221"}, "", {"Zeta", "918"}, {"nsub", "8836"}, "", {"Ccedil", "199"}, {"omicron", "959"}, {"zeta", "950"}, {"part", "8706"}, {"nabla", "8711"}, "", {"lt", "60"}, {"thetasym", "977"}, {"para", "182"}, {"Omega", "937"}, {"rsaquo", "8250"}, "", {"ordf", "170"}, "", {"oline", "8254"}, {"lsaquo", "8249"}, "", {"Eta", "919"}, "", {"Prime", "8243"}, {"ccedil", "231"}, "", {"sub", "8834"}, {"copy", "169"}, {"ucirc", "251"}, {"lowast", "8727"}, {"gt", "62"}, "", "", {"frac14", "188"}, {"ne", "8800"}, "", "", "", "", {"iquest", "191"}, "", {"tau", "964"}, {"Iota", "921"}, {"frac34", "190"}, {"uacute", "250"}, "", {"Tau", "932"}, {"cong", "8773"}, {"Gamma", "915"}, {"Lambda", "923"}, "", "", "", "", {"Otilde", "213"}, "", {"ETH", "208"}, {"infin", "8734"}, {"Ecirc", "202"}, "", "", {"beta", "946"}, "", {"Ucirc", "219"}, {"brvbar", "166"}, "", {"sect", "167"}, "", {"frac12", "189"}, {"curren", "164"}, "", {"cent", "162"}, "", {"Ocirc", "212"}, {"Eacute", "201"}, {"mu", "956"}, "", "", "", {"Uacute", "218"}, "", "", "", "", "", {"Xi", "926"}, {"ang", "8736"}, "", "", {"Oacute", "211"}, {"pi", "960"}, "", {"darr", "8595"}, {"equiv", "8801"}, {"yacute", "253"}, {"apos", "39"}, {"perp", "8869"}, "", "", "", "", "", {"delta", "948"}, {"radic", "8730"}, {"le", "8804"}, {"quot", "34"}, "", {"ouml", "246"}, {"crarr", "8629"}, "", {"ni", "8715"}, {"shy", "173"}, {"auml", "228"}, "", "", {"Omicron", "927"}, "", {"iuml", "239"}, {"aring", "229"}, {"Atilde", "195"}, "", "", {"euml", "235"}, {"diams", "9830"}, {"ge", "8805"}, "", "", {"Yuml", "376"}, {"empty", "8709"}, {"divide", "247"}, {"xi", "958"}, {"uml", "168"}, {"spades", "9824"}, {"clubs", "9827"}, {"dagger", "8224"}, "", {"Beta", "914"}, {"bull", "8226"}, {"Acirc", "194"}, {"lambda", "955"}, {"fnof", "402"}, {"sbquo", "8218"}, {"rang", null, "9002", "10217"}, {"Icirc", "206"}, "", {"alefsym", "8501"}, {"bdquo", "8222"}, {"lang", null, "9001", "10216"}, {"rceil", "8969"}, "", "", {"piv", "982"}, {"zwnj", "8204"}, {"lceil", "8968"}, {"Aacute", "193"}, "", {"sum", "8721"}, {"uarr", "8593"}, {"weierp", "8472"}, {"Iacute", "205"}, {"yen", "165"}, {"rsquo", "8217"}, {"Delta", "916"}, {"gamma", "947"}, "", "", {"lsquo", "8216"}, {"dArr", "8659"}, {"Alpha", "913"}, "", "", "", {"uuml", "252"}, "", "", "", {"rdquo", "8221"}, {"macr", "175"}, {"THORN", "222"}, "", "", {"ldquo", "8220"}, {"rarr", "8594"}, "", {"oslash", "248"}, "", {"real", "8476"}, {"larr", "8592"}, "", "", "", "", "", "", {"Dagger", "8225"}, {"Pi", "928"}, "", "", {"permil", "8240"}, {"plusmn", "177"}, "", "", {"Euml", "203"}, {"tilde", "732"}, {"middot", "183"}, "", {"oplus", "8853"}, {"Uuml", "220"}, {"Sigma", "931"}, {"ograve", "242"}, "", "", {"frasl", "8260"}, {"szlig", "223"}, {"agrave", "224"}, "", {"lrm", "8206"}, {"Ouml", "214"}, "", {"igrave", "236"}, "", {"raquo", "187"}, {"yuml", "255"}, {"sigma", "963"}, {"egrave", "232"}, {"deg", "176"}, {"laquo", "171"}, "", {"epsilon", "949"}, "", "", "", {"uArr", "8657"}, "", "", "", "", {"cedil", "184"}, {"hearts", "9829"}, "", "", "", {"iexcl", "161"}, {"times", "215"}, {"rfloor", "8971"}, "", "", "", "", {"lfloor", "8970"}, "", "", "", {"micro", "181"}, "", "", "", {"rArr", "8658"}, "", "", "", "", {"lArr", "8656"}, "", "", "", "", {"circ", "710"}, {"minus", "8722"}, "", "", "", "", "", {"ugrave", "249"}, "", "", "", {"upsilon", "965"}, "", "", "", {"Auml", "196"}, {"forall", "8704"}, "", "", "", {"Iuml", "207"}, {"Aring", "197"}, "", "", "", "", {"OElig", "338"}, {"Oslash", "216"}, "", "", "", "", "", "", "", "", "", {"Egrave", "200"}, "", "", {"harr", "8596"}, {"Epsilon", "917"}, {"Ugrave", "217"}, "", "", "", {"Upsilon", "933"}, "", {"reg", "174"}, {"rlm", "8207"}, "", "", {"Ograve", "210"}, "", "", {"oelig", "339"}, {"hellip", "8230"}, "", "", "", {"aelig", "230"}, {"ndash", "8211"}, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", {"sim", "8764"}, "", "", "", "", {"zwj", "8205"}, "", "", "", "", "", "", {"AElig", "198"}, "", "", {"eth", "240"}, "", {"sigmaf", "962"}, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", {"hArr", "8660"}, "", {"Agrave", "192"}, "", "", "", "", {"Igrave", "204"}, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", {"mdash", "8212"}, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", {"upsih", "978"}}
-- Terms used by the hashFunction handler in generating a character entity's index in the hashedHtmlEntities list
property hashFunctionTerms : {739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 10, 35, 20, 0, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 175, 135, 30, 60, 95, 5, 0, 5, 180, 739, 15, 5, 0, 15, 110, 110, 739, 5, 5, 5, 100, 739, 739, 0, 20, 0, 739, 739, 739, 739, 739, 739, 5, 60, 50, 0, 15, 144, 115, 215, 10, 225, 10, 95, 125, 25, 0, 5, 218, 90, 20, 0, 65, 35, 55, 45, 115, 5, 15, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739}
-- Utility handlers
on hashFunction(theKey)
-- Hash function that takes as input a character reference entity and returns the hashedHtmlEntities list index and value for that key (or null values if the input key does not match a list item)
-- The input key is the alphanumeric content of a character reference entity, e.g., "amp" for the ampersand character "&"
-- The hash function was obtained by running the GNU gperf perfect hash function generator on the complete list of HTML-4/XHTML character reference entities and transforming the C code into equivalent Applescript code
-- The hash function uses the hashFunctionTerms numeric list in the construction of the list index from the key
try
tell (get theKey's id)
set keyLength to length
set hashIndex to (my hashFunctionTerms's item ((item keyLength) + 1)) + keyLength + 1
if keyLength > 0 then
set hashIndex to hashIndex + (my hashFunctionTerms's item ((item 1) + 1))
if keyLength > 1 then
set hashIndex to hashIndex + (my hashFunctionTerms's item ((item 2) + 2))
if keyLength > 2 then
set hashIndex to hashIndex + (my hashFunctionTerms's item ((item 3) + 1))
if keyLength > 4 then
set hashIndex to hashIndex + (my hashFunctionTerms's item ((item 5) + 1))
end if
end if
end if
end if
end tell
set htmlEntity to my hashedHtmlEntities's item hashIndex
on error
set {hashIndex, htmlEntity} to {null, null}
end try
return {hashIndex:hashIndex, htmlEntity:htmlEntity}
end hashFunction
on hexToDec(hexString)
-- Hexadecimal-to-decimal string converter
-- Adapted from Nigel Garvey's hex-to-dec converter (http://macscripter.net/viewtopic.php?pid=71052)
set refString to "0123456789abcdef"
set tid to AppleScript's text item delimiters
try
tell hexString
if its class is not equal to text then error
if it = "" then return ""
end tell
set decVal to 0
ignoring case
repeat with i in (get hexString's characters)
set AppleScript's text item delimiters to i's contents
tell refString's text item 1's length
if it = 16 then error
set decVal to decVal * 16 + it
end tell
end repeat
end ignoring
set decimalString to decVal as text
on error
set AppleScript's text item delimiters to tid
error
end try
set AppleScript's text item delimiters to tid
return decimalString
end hexToDec
on reduceConsecutiveWhitespaces(theText)
set tid to AppleScript's text item delimiters
try
set AppleScript's text item delimiters to {tab, return, linefeed}
tell theText's text items
set AppleScript's text item delimiters to space
set theText to it as text
end tell
repeat
set AppleScript's text item delimiters to space & space
tell theText's text items
if length = 1 then exit repeat
set AppleScript's text item delimiters to space
set theText to it as text
end tell
end repeat
end try
set AppleScript's text item delimiters to tid
return theText
end reduceConsecutiveWhitespaces
end script
-- Wrap the code in a try block to capture any errors
set tid to AppleScript's text item delimiters
try
-- Process the handler argument
try
tell (handlerArgument & {condenseWhitespaces:false, decodeAngleBracketsPerHtml5:true})
if length is not equal to 3 then error
set {htmlString, condenseWhitespaces, decodeAngleBracketsPerHtml5} to {its htmlString, its condenseWhitespaces, its decodeAngleBracketsPerHtml5}
end tell
on error
error "The handler argument must be a record with the following required property label:" & return & return & tab & "htmlString" & return & return & "and the following optional property labels:" & return & return & tab & "condenseWhitespaces" & return & tab & "decodeAngleBracketsPerHtml5"
end try
if htmlString's class is not equal to text then error "The input argument htmlString is not a text string."
if condenseWhitespaces's class is not equal to boolean then error "The input argument condenseWhitespaces must be either true or false."
if decodeAngleBracketsPerHtml5's class is not equal to boolean then error "The input argument decodeAngleBracketsPerHtml5 must be either true or false."
-- Tailor the hashed html entities list to conform to the decodeAngleBracketsPerHtml5 setting
set {i1, i2} to {util's hashFunction("lang")'s hashIndex, util's hashFunction("rang")'s hashIndex}
tell (3 + (decodeAngleBracketsPerHtml5 as integer)) to set {util's hashedHtmlEntities's item i1's item 2, util's hashedHtmlEntities's item i2's item 2} to {util's hashedHtmlEntities's item i1's item it, util's hashedHtmlEntities's item i2's item it}
-- Split the string at each "&" character
set AppleScript's text item delimiters to "&"
tell (get htmlString's text items)
-- Handle the case of an html string without "&" characters
if length < 2 then return htmlString
-- Begin the decoded string with the text preceding the first "&" character
set {decodedSubstrings, ampersandPrefixedSubstrings} to {{item 1}, rest}
end tell
-- Process the &-prefixed substrings
repeat with currAmpersandPrefixedSubstring in ampersandPrefixedSubstrings
try
-- Split the current &-prefixed substring at each ";" character
set AppleScript's text item delimiters to ";"
tell (get currAmpersandPrefixedSubstring's text items)
-- Handle the case of an &-prefixed substring without ";" characters
if length < 2 then error
-- Get the text preceding the first ";" character, which is the html entity candidate
set {htmlEntityCandidate, semicolonPrefixedSubstrings} to {item 1, rest}
end tell
-- Test if the current html entity candidate is valid
if htmlEntityCandidate starts with "#x" then
-- Try processing the current html entity candidate as a hexadecimal numeric entity, and get its equivalent decimal value
set decimalString to util's hexToDec(htmlEntityCandidate's text 3 thru -1)
else if htmlEntityCandidate starts with "#" then
-- Else try processing the current html entity candidate as a decimal numeric entity, and get its decimal value
tell htmlEntityCandidate's text 2 thru -1
ignoring white space
-- Flag as invalid a decimal numeric entity that starts or ends with a whitespace character
if (text 1 = "") or (text -1 = "") then error
end ignoring
set decimalString to it
end tell
else
-- Else try processing the current html entity candidate as a character entity, and get its corresponding decimal value from the hashed html entities list
tell htmlEntityCandidate's length to if (it < 2) or (it > 8) then error -- flags as invalid a character entity if its alphanumeric component's string length is invalid
set hashedHtmlEntity to util's hashFunction(htmlEntityCandidate)'s htmlEntity
considering case
-- Confirm that the hash function returned the proper html entity
if htmlEntityCandidate is not equal to hashedHtmlEntity's item 1 then error
end considering
set decimalString to hashedHtmlEntity's item 2
end if
-- Replace the html entity at the start of the current &-prefixed substring with its underlying Unicode character, and append the modified substring to the decoded string; if the replacement fails, the error will leave the current substring in its original form
set end of decodedSubstrings to (character id decimalString) & (semicolonPrefixedSubstrings as text)
on error
-- If any error is encountered, thus signifying that the current &-prefixed substring does not start with a valid html entity, leave the current substring in its original form
set end of decodedSubstrings to "&" & currAmpersandPrefixedSubstring
end try
end repeat
-- Transform the list of substrings into a single decoded text string
set AppleScript's text item delimiters to ""
set decodedString to decodedSubstrings as text
-- Condense consecutive whitespace characters if specified in the input argument condenseWhitespaces
if condenseWhitespaces then set decodedString to util's reduceConsecutiveWhitespaces(decodedString)
on error m number n
set AppleScript's text item delimiters to tid
if n = -128 then error number -128
if n ≠ -2700 then set m to "(" & n & ") " & m
error ("Problem with handler decodeHtml:" & return & return & m) number n
end try
set AppleScript's text item delimiters to tid
-- Return the results
return decodedString
end decodeHtml
{Edit note: The entire post was just resubmitted without changes to take advantage of the fact that the MacScripter website now renders Unicode characters properly. Previously garbled areas of text should now display properly.]