You are not logged in.
NOTE: In the discussion section only (not the handler code), an ASCII 0 null character has been inserted between "&" and the remaining characters of HTML entities to prevent the HTML entities from being decoded into their original Unicode characters by the web browser. This won't affect the "Open this Scriptlet in your editor" sections, whose Applescript code may be copied and used as is.
The current handler is an HTML decoder written in Applescript. It converts HTML entities in an HTML document to their underlying Unicode characters. HTML entities may be in the form of character reference entities such as &code; , decimal numeric character entities such as © , and hexadecimal numeric character entities such as © , all representing the copyright character © (Unicode code point 169) in the current example. (Note that the current handler is not a URL decoder, with which HTML decoders are sometimes confused. URL decoders handle percent-encoded strings like http://ascii.cl?parameter=%22Click+on+%27URL+Decode%27%21%22.)
The handler can decode the following:
- Any valid decimal or hexadecimal numeric character entity
- All 252 character reference entities in the HTML 4 entity set
- The five predefined XML entities " , & , ' , < , and > , representing the double quotation mark, ampersand, apostrophe, less-than sign, and greater-than sign (Unicode code points 34, 38, 39, 60, and 62, respectively), which apart from ' are a subset of the HTML 4 entity set
The handler can process input text strings up to at least 1,000,000 characters in length and containing up to at least 8000 HTML entities (the limits of testing thus far). It takes advantage of several features to maximize execution speed: Applescript's text item delimiters for searching and replacing, the reconstruction of characters from their Unicode code points via the character id ... specifier, and hashed access to a master property list of character reference entities.
The handler takes an Applescript record as its input argument with one required and two optional boolean properties:
Required property:
htmlString - the text string to be decoded
Optional boolean properties:
condenseWhitespaces
true - condenses consecutive space, tab, return, and/or linefeed characters (ASCII 32, 9, 13, and 10) in the input text string into a single space character as per the HTML 5 standard
false (the default value if the property is omitted) - preserves white space characters exactly as they appear in the input string as per the HTML 4 standard
decodeAngleBracketsPerHtml5
true (the default value if the property is omitted) - decodes the character reference entities ⟨ and ⟩ as mathematical left and right angle brackets (Unicode code points 10216 and 10217) as per the HTML 5 standard
false - decodes ⟨ and ⟩ as left- and right-pointing angle brackets (Unicode code points 9001 and 9002) as per the HTML 4 standard (now deprecated in the current Unicode standard)
The handler's return value is the decoded text string.
Several HTML decoding solutions are available on the Mac OS X platform, including Cocoa's NSAttributedString class used in association with the NSHTMLTextDocumentType attribute, the decode_entities function of Perl's HTML::Entities module, the unescape function of Python's HTMLParser module, and PHP's html_entity_decode function. The current handler offers the following potential advantages:
(1) Offers more granular control over whitespace handling than the alternative solutions (to my knowledge, the Cocoa solution always implements HTML 5 behavior and condenses consecutive whitespace characters)
(2) Does not strip HTML tags, in contrast with the Cocoa solution which strips HTML tags
(3) For input text strings with fewer than about 250 HTML entities to be decoded:
- Executes faster than the Perl, Python, and PHP solutions for input text strings smaller than about 100,000 characters in length
- Executes at speeds similar to the Perl, Python, and PHP solutions for input text strings greater than about 100,000 characters in length
(4) For input text strings with fewer than about 25 HTML entities to be decoded:
- Executes faster than the Cocoa solution for input text strings smaller than about 10,000 characters in length
- Executes at speeds similar to the Cocoa solution for input text strings greater than about 10,000 characters in length
The following are known limitations of the current handler:
(1) Does not decode incompletely formed HTML entities, specifically those lacking the trailing semicolon character (this mimics
the behavior of the PHP decoder)
(2) Does not recognize the over 2000 new named character references in the HTML 5 standard, which represent a superset of the HTML 4 entity set and which seem thus far at least to be largely unimplemented on web pages (the hashed property list required to handle this many items would exceed Applescript's list size limit; NOTE: the current handler does recognize the newer entities - in fact, all HTML entities - if they are coded in decimal or hexadecimal numeric form)
(3) Executes more slowly than the alternative solutions for input text strings with more than about 250 HTML entities to be decoded (for example, execution times to decode a Wikipedia article ~900,000 characters in length and containing ~4550 HTML entities: Perl ~0.5 sec, Cocoa ~1.5 sec, current Applescript handler ~2.5 sec)
(4) Slows down significantly if the condenseWhitespaces input argument is set to true and the input text string contains a substantial number of long consecutive sequences of whitespace characters
HANDLER EXAMPLES HIGHLIGHTING ALTERNATIVE HTML ENTITY FORMS AND INPUT ARGUMENT SETTINGS:
Applescript:
set leftAngleBracket to "&" & "lang;"
set rightAngleBracket to "&" & "rang;"
set rightArrow to "&" & "#x2794;"
set ellipsis1 to "&" & "hellip;"
set ellipsis2 to "&" & "#8230;"
set ellipsis3 to "&" & "#x2026;"
set spades to "&" & "#9824;"
set hearts to "&" & "hearts;"
set clubs to "&" & "clubs;"
set diamonds to "&" & "#x2666;"
set str to leftAngleBracket & "Suits of a deck of cards" & rightAngleBracket & " " & rightArrow & tab & tab & tab & space & space & space & ellipsis1 & " " & spades & " " & ellipsis2 & " " & hearts & " " & ellipsis3 & " " & clubs & " " & ellipsis2 & " " & diamonds & " " & ellipsis1
decodeHtml({htmlString:str, condenseWhitespaces:false, decodeAngleBracketsPerHtml5:true})
-- or equivalently and taking advantage of default values for optional properties --
decodeHtml({htmlString:str}) -->
"⟨Suits of a deck of cards⟩ ➔ … ♠ … ♥ … ♣ … ♦ …"
decodeHtml({htmlString:str, condenseWhitespaces:false, decodeAngleBracketsPerHtml5:false}) -->
"〈Suits of a deck of cards〉 ➔ … ♠ … ♥ … ♣ … ♦ …"
decodeHtml({htmlString:str, condenseWhitespaces:true, decodeAngleBracketsPerHtml5:true}) -->
"⟨Suits of a deck of cards⟩ ➔ … ♠ … ♥ … ♣ … ♦ …"
decodeHtml({htmlString:str, condenseWhitespaces:true, decodeAngleBracketsPerHtml5:false}) -->
"〈Suits of a deck of cards〉 ➔ … ♠ … ♥ … ♣ … ♦ …"
(* Notes:
- "&" is separated from the remaining characters of the HTML entity in the initial "set" statements to prevent decoding by the web browser
- ellipsis1, ellipsis2, and ellipsis3 are the character, decimal numeric, and hexadecimal numeric forms of the same ellipsis character (Unicode code point 8230)
- Angle brackets are Unicode code points 10216 and 10217 in the first and third output strings, and Unicode code points 9001 and 9002 in the second and fourth output strings
*)
HANDLER EXAMPLE HIGHLIGHTING THE DETECTION OF INVALID HTML ENTITIES:
Applescript:
set validCharEntity to "&" & "copy;"
set invalidCharEntity to "&" & "copy123;"
set validDecEntity to "&" & "#169;"
set invalidDecEntity to "&" & "#A69;"
set validHexEntity to "&" & "#xA9;"
set invalidHexEntity to "&" & "#xZ9;"
set str to "" & ¬
"Valid character entity: " & validCharEntity & return & ¬
"Nonexistent character entity and thus not decoded: " & invalidCharEntity & return & ¬
"Valid decimal numeric entity: " & validDecEntity & return & ¬
"Invalid decimal numeric entity and thus not decoded: " & invalidDecEntity & return & ¬
"Valid hexadecimal numeric entity: " & validHexEntity & return & ¬
"Invalid hexadecimal numeric entity and thus not decoded: " & invalidHexEntity
decodeHtml({htmlString:str, condenseWhitespaces:false}) -->
"Valid character entity: ©
Nonexistent character entity and thus not decoded: ©123;
Valid decimal numeric entity: ©
Invalid decimal numeric entity and thus not decoded: A69;
Valid hexadecimal numeric entity: ©
Invalid hexadecimal numeric entity and thus not decoded: Z9;"
(* Note:
- "&" is separated from the remaining characters of the HTML entity in the initial "set" statements to prevent decoding by the web browser
*)
HANDLER:
Applescript:
on decodeHtml(handlerArgument)
(*
- This handler decodes character, decimal numeric, and hexadecimal numeric html entities in an html string
- It takes an Applescript record as its input argument with one required and two optional boolean properties:
Required property:
htmlString - the text string to be decoded
Optional boolean properties:
condenseWhitespaces
true - condenses consecutive space, tab, return, and/or linefeed characters (ASCII 32, 9, 13, and 10) in the input text string into a single space character as per the HTML 5 standard
false (the default value if the property is omitted) - preserves white space characters exactly as they appear in the input string as per the HTML 4 standard
decodeAngleBracketsPerHtml5
true (the default value if the property is omitted) - decodes the character reference entities "lang" and "rang" as mathematical left and right angle brackets (Unicode code points 10216 and 10217) as per the HTML 5 standard
false - decodes "lang" and "rang" as left- and right-pointing angle brackets (Unicode code points 9001 and 9002) as per the HTML 4 standard (now deprecated in the current Unicode standard)
- It returns the decoded html string
*)
-- Utility script
script util
-- Comprehensive character-entity-based hashed list of HTML-4/XHTML entities, each sublist containing a given html entity's character and decimal numeric forms in positions 1 and 2, respectively
-- An entity's index in this list is generated by the hashFunction handler with the entity's character form as the handler argument
-- The decimal numeric forms for "lang" and "rang" are initialized to the null value and will be set during runtime to either their HTML 4 ("9001", "9002") or HTML 5 ("10216", "10217") values depending on the value of the input argument decodeAngleBracketsPerHtml5
property hashedHtmlEntities : {"", "", "", "", "", "", "", "", {"and", "8743"}, "", "", "", "", {"int", "8747"}, "", "", "", "", {"Rho", "929"}, "", "", "", "", "", {"iota", "953"}, "", "", "", {"psi", "968"}, {"prod", "8719"}, "", "", "", {"not", "172"}, {"prop", "8733"}, "", "", "", {"phi", "966"}, {"sdot", "8901"}, {"theta", "952"}, {"Scaron", "352"}, "", {"amp", "38"}, {"ensp", "8194"}, {"Theta", "920"}, {"there4", "8756"}, "", "", {"isin", "8712"}, "", {"thinsp", "8201"}, "", "", "", {"omega", "969"}, {"scaron", "353"}, "", "", "", {"trade", "8482"}, "", "", {"Chi", "935"}, "", {"thorn", "254"}, "", "", {"sup", "8835"}, {"emsp", "8195"}, {"prime", "8242"}, "", "", "", {"sup1", "185"}, {"image", "8465"}, "", "", "", {"supe", "8839"}, {"pound", "163"}, "", "", {"chi", "967"}, {"sup3", "179"}, {"notin", "8713"}, "", "", "", "", {"kappa", "954"}, "", "", {"eta", "951"}, "", {"Kappa", "922"}, {"otilde", "245"}, "", {"cup", "8746"}, {"sup2", "178"}, "", {"atilde", "227"}, {"Mu", "924"}, {"rho", "961"}, {"nbsp", "160"}, {"acute", "180"}, "", "", "", "", "", {"Ntilde", "209"}, {"or", "8744"}, {"loz", "9674"}, "", {"ocirc", "244"}, {"otimes", "8855"}, {"Nu", "925"}, "", "", {"acirc", "226"}, {"ntilde", "241"}, "", {"cap", "8745"}, "", {"icirc", "238"}, "", {"nu", "957"}, "", "", {"ecirc", "234"}, {"oacute", "243"}, "", {"Psi", "936"}, {"sube", "8838"}, {"asymp", "8776"}, {"aacute", "225"}, "", "", "", "", {"iacute", "237"}, "", {"Phi", "934"}, {"euro", "8364"}, {"exist", "8707"}, {"eacute", "233"}, "", "", {"ordm", "186"}, {"alpha", "945"}, {"Yacute", "221"}, "", {"Zeta", "918"}, {"nsub", "8836"}, "", {"Ccedil", "199"}, {"omicron", "959"}, {"zeta", "950"}, {"part", "8706"}, {"nabla", "8711"}, "", {"lt", "60"}, {"thetasym", "977"}, {"para", "182"}, {"Omega", "937"}, {"rsaquo", "8250"}, "", {"ordf", "170"}, "", {"oline", "8254"}, {"lsaquo", "8249"}, "", {"Eta", "919"}, "", {"Prime", "8243"}, {"ccedil", "231"}, "", {"sub", "8834"}, {"copy", "169"}, {"ucirc", "251"}, {"lowast", "8727"}, {"gt", "62"}, "", "", {"frac14", "188"}, {"ne", "8800"}, "", "", "", "", {"iquest", "191"}, "", {"tau", "964"}, {"Iota", "921"}, {"frac34", "190"}, {"uacute", "250"}, "", {"Tau", "932"}, {"cong", "8773"}, {"Gamma", "915"}, {"Lambda", "923"}, "", "", "", "", {"Otilde", "213"}, "", {"ETH", "208"}, {"infin", "8734"}, {"Ecirc", "202"}, "", "", {"beta", "946"}, "", {"Ucirc", "219"}, {"brvbar", "166"}, "", {"sect", "167"}, "", {"frac12", "189"}, {"curren", "164"}, "", {"cent", "162"}, "", {"Ocirc", "212"}, {"Eacute", "201"}, {"mu", "956"}, "", "", "", {"Uacute", "218"}, "", "", "", "", "", {"Xi", "926"}, {"ang", "8736"}, "", "", {"Oacute", "211"}, {"pi", "960"}, "", {"darr", "8595"}, {"equiv", "8801"}, {"yacute", "253"}, {"apos", "39"}, {"perp", "8869"}, "", "", "", "", "", {"delta", "948"}, {"radic", "8730"}, {"le", "8804"}, {"quot", "34"}, "", {"ouml", "246"}, {"crarr", "8629"}, "", {"ni", "8715"}, {"shy", "173"}, {"auml", "228"}, "", "", {"Omicron", "927"}, "", {"iuml", "239"}, {"aring", "229"}, {"Atilde", "195"}, "", "", {"euml", "235"}, {"diams", "9830"}, {"ge", "8805"}, "", "", {"Yuml", "376"}, {"empty", "8709"}, {"divide", "247"}, {"xi", "958"}, {"uml", "168"}, {"spades", "9824"}, {"clubs", "9827"}, {"dagger", "8224"}, "", {"Beta", "914"}, {"bull", "8226"}, {"Acirc", "194"}, {"lambda", "955"}, {"fnof", "402"}, {"sbquo", "8218"}, {"rang", null, "9002", "10217"}, {"Icirc", "206"}, "", {"alefsym", "8501"}, {"bdquo", "8222"}, {"lang", null, "9001", "10216"}, {"rceil", "8969"}, "", "", {"piv", "982"}, {"zwnj", "8204"}, {"lceil", "8968"}, {"Aacute", "193"}, "", {"sum", "8721"}, {"uarr", "8593"}, {"weierp", "8472"}, {"Iacute", "205"}, {"yen", "165"}, {"rsquo", "8217"}, {"Delta", "916"}, {"gamma", "947"}, "", "", {"lsquo", "8216"}, {"dArr", "8659"}, {"Alpha", "913"}, "", "", "", {"uuml", "252"}, "", "", "", {"rdquo", "8221"}, {"macr", "175"}, {"THORN", "222"}, "", "", {"ldquo", "8220"}, {"rarr", "8594"}, "", {"oslash", "248"}, "", {"real", "8476"}, {"larr", "8592"}, "", "", "", "", "", "", {"Dagger", "8225"}, {"Pi", "928"}, "", "", {"permil", "8240"}, {"plusmn", "177"}, "", "", {"Euml", "203"}, {"tilde", "732"}, {"middot", "183"}, "", {"oplus", "8853"}, {"Uuml", "220"}, {"Sigma", "931"}, {"ograve", "242"}, "", "", {"frasl", "8260"}, {"szlig", "223"}, {"agrave", "224"}, "", {"lrm", "8206"}, {"Ouml", "214"}, "", {"igrave", "236"}, "", {"raquo", "187"}, {"yuml", "255"}, {"sigma", "963"}, {"egrave", "232"}, {"deg", "176"}, {"laquo", "171"}, "", {"epsilon", "949"}, "", "", "", {"uArr", "8657"}, "", "", "", "", {"cedil", "184"}, {"hearts", "9829"}, "", "", "", {"iexcl", "161"}, {"times", "215"}, {"rfloor", "8971"}, "", "", "", "", {"lfloor", "8970"}, "", "", "", {"micro", "181"}, "", "", "", {"rArr", "8658"}, "", "", "", "", {"lArr", "8656"}, "", "", "", "", {"circ", "710"}, {"minus", "8722"}, "", "", "", "", "", {"ugrave", "249"}, "", "", "", {"upsilon", "965"}, "", "", "", {"Auml", "196"}, {"forall", "8704"}, "", "", "", {"Iuml", "207"}, {"Aring", "197"}, "", "", "", "", {"OElig", "338"}, {"Oslash", "216"}, "", "", "", "", "", "", "", "", "", {"Egrave", "200"}, "", "", {"harr", "8596"}, {"Epsilon", "917"}, {"Ugrave", "217"}, "", "", "", {"Upsilon", "933"}, "", {"reg", "174"}, {"rlm", "8207"}, "", "", {"Ograve", "210"}, "", "", {"oelig", "339"}, {"hellip", "8230"}, "", "", "", {"aelig", "230"}, {"ndash", "8211"}, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", {"sim", "8764"}, "", "", "", "", {"zwj", "8205"}, "", "", "", "", "", "", {"AElig", "198"}, "", "", {"eth", "240"}, "", {"sigmaf", "962"}, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", {"hArr", "8660"}, "", {"Agrave", "192"}, "", "", "", "", {"Igrave", "204"}, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", {"mdash", "8212"}, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", {"upsih", "978"}}
-- Terms used by the hashFunction handler in generating a character entity's index in the hashedHtmlEntities list
property hashFunctionTerms : {739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 10, 35, 20, 0, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 175, 135, 30, 60, 95, 5, 0, 5, 180, 739, 15, 5, 0, 15, 110, 110, 739, 5, 5, 5, 100, 739, 739, 0, 20, 0, 739, 739, 739, 739, 739, 739, 5, 60, 50, 0, 15, 144, 115, 215, 10, 225, 10, 95, 125, 25, 0, 5, 218, 90, 20, 0, 65, 35, 55, 45, 115, 5, 15, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739}
-- Utility handlers
on hashFunction(theKey)
-- Hash function that takes as input a character reference entity and returns the hashedHtmlEntities list index and value for that key (or null values if the input key does not match a list item)
-- The input key is the alphanumeric content of a character reference entity, e.g., "amp" for the ampersand character "&"
-- The hash function was obtained by running the GNU gperf perfect hash function generator on the complete list of HTML-4/XHTML character reference entities and transforming the C code into equivalent Applescript code
-- The hash function uses the hashFunctionTerms numeric list in the construction of the list index from the key
try
tell (get theKey's id)
set keyLength to length
set hashIndex to (my hashFunctionTerms's item ((item keyLength) + 1)) + keyLength + 1
if keyLength > 0 then
set hashIndex to hashIndex + (my hashFunctionTerms's item ((item 1) + 1))
if keyLength > 1 then
set hashIndex to hashIndex + (my hashFunctionTerms's item ((item 2) + 2))
if keyLength > 2 then
set hashIndex to hashIndex + (my hashFunctionTerms's item ((item 3) + 1))
if keyLength > 4 then
set hashIndex to hashIndex + (my hashFunctionTerms's item ((item 5) + 1))
end if
end if
end if
end if
end tell
set htmlEntity to my hashedHtmlEntities's item hashIndex
on error
set {hashIndex, htmlEntity} to {null, null}
end try
return {hashIndex:hashIndex, htmlEntity:htmlEntity}
end hashFunction
on hexToDec(hexString)
-- Hexadecimal-to-decimal string converter
-- Adapted from Nigel Garvey's hex-to-dec converter ([url]http://macscripter.net/viewtopic.php?pid=71052[/url])
set refString to "0123456789abcdef"
set tid to AppleScript's text item delimiters
try
tell hexString
if its class is not equal to text then error
if it = "" then return ""
end tell
set decVal to 0
ignoring case
repeat with i in (get hexString's characters)
set AppleScript's text item delimiters to i's contents
tell refString's text item 1's length
if it = 16 then error
set decVal to decVal * 16 + it
end tell
end repeat
end ignoring
set decimalString to decVal as text
on error
set AppleScript's text item delimiters to tid
error
end try
set AppleScript's text item delimiters to tid
return decimalString
end hexToDec
on reduceConsecutiveWhitespaces(theText)
set tid to AppleScript's text item delimiters
try
set AppleScript's text item delimiters to {tab, return, linefeed}
tell theText's text items
set AppleScript's text item delimiters to space
set theText to it as text
end tell
repeat
set AppleScript's text item delimiters to space & space
tell theText's text items
if length = 1 then exit repeat
set AppleScript's text item delimiters to space
set theText to it as text
end tell
end repeat
end try
set AppleScript's text item delimiters to tid
return theText
end reduceConsecutiveWhitespaces
end script
-- Wrap the code in a try block to capture any errors
set tid to AppleScript's text item delimiters
try
-- Process the handler argument
try
tell (handlerArgument & {condenseWhitespaces:false, decodeAngleBracketsPerHtml5:true})
if length is not equal to 3 then error
set {htmlString, condenseWhitespaces, decodeAngleBracketsPerHtml5} to {its htmlString, its condenseWhitespaces, its decodeAngleBracketsPerHtml5}
end tell
on error
error "The handler argument must be a record with the following required property label:" & return & return & tab & "htmlString" & return & return & "and the following optional property labels:" & return & return & tab & "condenseWhitespaces" & return & tab & "decodeAngleBracketsPerHtml5"
end try
if htmlString's class is not equal to text then error "The input argument htmlString is not a text string."
if condenseWhitespaces's class is not equal to boolean then error "The input argument condenseWhitespaces must be either true or false."
if decodeAngleBracketsPerHtml5's class is not equal to boolean then error "The input argument decodeAngleBracketsPerHtml5 must be either true or false."
-- Tailor the hashed html entities list to conform to the decodeAngleBracketsPerHtml5 setting
set {i1, i2} to {util's hashFunction("lang")'s hashIndex, util's hashFunction("rang")'s hashIndex}
tell (3 + (decodeAngleBracketsPerHtml5 as integer)) to set {util's hashedHtmlEntities's item i1's item 2, util's hashedHtmlEntities's item i2's item 2} to {util's hashedHtmlEntities's item i1's item it, util's hashedHtmlEntities's item i2's item it}
-- Split the string at each "&" character
set AppleScript's text item delimiters to "&"
tell (get htmlString's text items)
-- Handle the case of an html string without "&" characters
if length < 2 then return htmlString
-- Begin the decoded string with the text preceding the first "&" character
set {decodedSubstrings, ampersandPrefixedSubstrings} to {{item 1}, rest}
end tell
-- Process the &-prefixed substrings
repeat with currAmpersandPrefixedSubstring in ampersandPrefixedSubstrings
try
-- Split the current &-prefixed substring at each ";" character
set AppleScript's text item delimiters to ";"
tell (get currAmpersandPrefixedSubstring's text items)
-- Handle the case of an &-prefixed substring without ";" characters
if length < 2 then error
-- Get the text preceding the first ";" character, which is the html entity candidate
set {htmlEntityCandidate, semicolonPrefixedSubstrings} to {item 1, rest}
end tell
-- Test if the current html entity candidate is valid
if htmlEntityCandidate starts with "#x" then
-- Try processing the current html entity candidate as a hexadecimal numeric entity, and get its equivalent decimal value
set decimalString to util's hexToDec(htmlEntityCandidate's text 3 thru -1)
else if htmlEntityCandidate starts with "#" then
-- Else try processing the current html entity candidate as a decimal numeric entity, and get its decimal value
tell htmlEntityCandidate's text 2 thru -1
ignoring white space
-- Flag as invalid a decimal numeric entity that starts or ends with a whitespace character
if (text 1 = "") or (text -1 = "") then error
end ignoring
set decimalString to it
end tell
else
-- Else try processing the current html entity candidate as a character entity, and get its corresponding decimal value from the hashed html entities list
tell htmlEntityCandidate's length to if (it < 2) or (it > 8) then error -- flags as invalid a character entity if its alphanumeric component's string length is invalid
set hashedHtmlEntity to util's hashFunction(htmlEntityCandidate)'s htmlEntity
considering case
-- Confirm that the hash function returned the proper html entity
if htmlEntityCandidate is not equal to hashedHtmlEntity's item 1 then error
end considering
set decimalString to hashedHtmlEntity's item 2
end if
-- Replace the html entity at the start of the current &-prefixed substring with its underlying Unicode character, and append the modified substring to the decoded string; if the replacement fails, the error will leave the current substring in its original form
set end of decodedSubstrings to (character id decimalString) & (semicolonPrefixedSubstrings as text)
on error
-- If any error is encountered, thus signifying that the current &-prefixed substring does not start with a valid html entity, leave the current substring in its original form
set end of decodedSubstrings to "&" & currAmpersandPrefixedSubstring
end try
end repeat
-- Transform the list of substrings into a single decoded text string
set AppleScript's text item delimiters to ""
set decodedString to decodedSubstrings as text
-- Condense consecutive whitespace characters if specified in the input argument condenseWhitespaces
if condenseWhitespaces then set decodedString to util's reduceConsecutiveWhitespaces(decodedString)
on error m number n
set AppleScript's text item delimiters to tid
if n = -128 then error number -128
if n ≠ -2700 then set m to "(" & n & ") " & m
error ("Problem with handler decodeHtml:" & return & return & m) number n
end try
set AppleScript's text item delimiters to tid
-- Return the results
return decodedString
end decodeHtml
{Edit note: The entire post was just resubmitted without changes to take advantage of the fact that the MacScripter website now renders Unicode characters properly. Previously garbled areas of text should now display properly.]
Last edited by bmose (2018-03-22 06:59:55 am)
Offline
Wow. That looks like a pretty comprehensive vanilla solution!
It's somewhat easier in ASObjC, which leaves all the decoding effort to methods built into Mac OS's Foundation framework:
Applescript:
use AppleScript version "2.4" -- Mac OS 10.10 (Yosemite) or later
use framework "Foundation"
on decodeHtml(handlerArgument)
try
-- Set variables to the arguments, using default values for any omitted. (Defaults for space condensation and angle bracket as per HTML5.)
set {htmlString:str, condenseWhitespaces:condenseWhitespaces, decodeAngleBracketsPerHtml5:decodeAngleBracketsPerHtml5} to handlerArgument & {htmlString:missing value, condenseWhitespaces:true, decodeAngleBracketsPerHtml5:true}
if (str is missing value) then error
on error
error "The handler argument must be a record with the following required property label:" & return & return & tab & "htmlString" & return & return & "and the following optional property labels:" & return & return & tab & "condenseWhitespaces" & return & tab & "decodeAngleBracketsPerHtml5"
end try
set |⌘| to current application
-- Get an NSMutableString version of the input string.
set str to |⌘|'s class "NSMutableString"'s stringWithString:(str)
-- If not condensing white spaces, replace them with HTML equivalents.
if (not condenseWhitespaces) then
-- (The concatentations shown here are only needed when displaying the script code on a Web site. The entities themselves can be used otherwise.)
tell str to replaceOccurrencesOfString:(space) withString:("&" & "nbsp;") options:(0) range:({0, its |length|()})
tell str to replaceOccurrencesOfString:(tab) withString:("&" & "#9;") options:(0) range:({0, its |length|()})
tell str to replaceOccurrencesOfString:("\\R") withString:("<br />") options:(|⌘|'s NSRegularExpressionSearch) range:({0, its |length|()})
end if
-- Derive an NSData object from the HTML string and an NSAttributedString from that.
set HTMLData to str's dataUsingEncoding:(|⌘|'s NSUTF8StringEncoding)
set attributedStr to |⌘|'s class "NSAttributedString"'s alloc()'s initWithHTML:(HTMLData) documentAttributes:(missing value)
-- Read off the decoded string from the NSAttributedString.
set decodedString to attributedStr's |string|()
-- Any angle brackets in the result are HTML5 interpretations. Replace them with the other type if required.
if (not decodeAngleBracketsPerHtml5) then
set decodedString to decodedString's stringByReplacingOccurrencesOfString:(character id 10216) withString:(character id 9001)
set decodedString to decodedString's stringByReplacingOccurrencesOfString:(character id 10217) withString:(character id 9002)
end if
-- Return the final result as AppleScript text.
return decodedString as text
end decodeHtml
NG
Offline
Thanks, Nigel, and wow, what a creative way to preserve whitespaces with the ASObjC decoder! That was one of the problems that prompted me to write an Applescript solution. The other side-effect of the ASObjcC decoder is that it strips away HTML tags. I often use those tags as handles for regular expression searches of downloaded web pages. The only Cocoa solution I could find involves Core Foundation's CFXMLCreateStringByUnescapingEntities function. If that could be bridged to Applescript, that might be yet another good solution.
Offline
The only Cocoa solution I could find involves Core Foundation's CFXMLCreateStringByUnescapingEntities function. If that could be bridged to Applescript, that might be yet another good solution.
When using xml entities you should use the HTML entities DTD list which can be found here: https://www.w3.org/TR/xhtml-modularizat … r_entities
Offline
Thanks for the link, DJ. Is there any way of executing the CFXMLCreateStringByUnescapingEntities function from an Applescript script directly or indirectly that does not involve a full-fledged Cocoa application?
Offline
The other side-effect of the ASObjcC decoder is that it strips away HTML tags. I often use those tags as handles for regular expression searches of downloaded web pages.
I suppose it depends on what your ultimate aim is. In a script I use to get the content of MacScripter thread pages, formatted in a certain way as plain text, I use the tags to identify the sections I want to edit, do the edits, then delete all irrelevant tags and run whatever's left through an NSAttributedString. If your aim's just to convert HTML entities but leave the tags in place, you might get away with entitising (ouch!) the tag brackets too in the script above:
Applescript:
if (keepingTags) then
tell str to replaceOccurrencesOfString:("<") withString:("&" & "lt;") options:(0) range:({0, its |length|()})
tell str to replaceOccurrencesOfString:(">") withString:("&" & "gt;") options:(0) range:({0, its |length|()})
end if
NG
Offline
Thanks for the link, DJ. Is there any way of executing the CFXMLCreateStringByUnescapingEntities function from an Applescript script directly or indirectly that does not involve a full-fledged Cocoa application?
You can wrap it in a framework. There are several open source third-party frameworks with categories on NSString to do what you want -- you could build your own framework from one of those.
Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/
latenightsw.com
Offline
If your aim's just to convert HTML entities but leave the tags in place, you might get away with entitising (ouch!) the tag brackets too
Thanks for another creative entitization (:lol:) suggestion. I tend to parse as you do: use tags as search handles -> extract the desired text -> decode HTML entities. My focus on preserving HTML tags is more for robustness so that that option is available for some future need.
There are several open source third-party frameworks with categories on NSString
That sounds like a great solution for the current task and also a powerful tool in general. I don't have experience creating frameworks. Could you possibly suggest a specific link that might be particularly helpful for someone like myself climbing up the learning curve?
Last edited by bmose (2017-06-08 11:09:20 pm)
Offline
I don't have experience creating frameworks. Could you possibly suggest a specific link that might be particularly helpful for someone like myself climbing up the learning curve?
If you've used Xcode at all, it's pretty simple. Assuming you have tracked down suitable Objective-C files, you create a new project in Xcode, choose macOS and Cocoa Framework as the template, then add your Objective-C .h and .m files to the project. The only settings you may need to change are for deployment target, and what headers are exposed (Build Phases -> Headers, and make public what you want exposed).
So in theory you can go to something like this <https://github.com/mwaterfall/MWFeedParser>, download it, copy NSString+HTML.h, NSString+HTML.m, GTMNSString+HTML.h and GTMNSString+HTML.m to your project (plus the required copyright attributions), Build (For Profiling), put it in ~/Library/Frameworks and use it like this:
Applescript:
use framework "Foundation"
use framework "NameOfFramework"
use scripting additions
on decodeHtml(handlerArgument)
set str to htmlString of handlerArgument
set str to current application's NSString's stringWithString:str
return str's stringByDecodingHTMLEntities() as string
end decodeHtml
Because the code is a category that extends NSString rather than adding a new class, that's it.
However... that's not a totally good idea. The problem is that when you use that in scripts run from app menus, you're loading the framework into the host app. And it's bad form to add categories like that -- it's probably safe, but there's a small element of risk. I wouldn't distribute it that way. You're better off changing the categories to new classes with your own prefix. It's more work to call them that way, but it's safer. (This is what I do with SMSForder in BridgePlus).
Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/
latenightsw.com
Offline
Thanks for the link and the very helpful instructions. It opens up so many possibilities. I appreciate your safety advice about categories and will try to get into the habit of making my own classes from the start.
Offline