Friday, July 21, 2017

#1 2017-06-08 05:37:02 am

bmose
Member
From:: Massachusetts
Registered: 2006-01-03
Posts: 195

Native Applescript HTML entity decoder

NOTE: In the following discussion, an ASCII 0 null character has been inserted between "&" and the remaining characters of HTML entities to prevent the HTML entities from being decoded into their original Unicode characters by the web browser. This won't affect the "Open this Scriptlet in your editor" sections, whose Applescript code may be copied and used as is.

The current handler is an HTML decoder written in Applescript. It converts HTML entities in an HTML document to their underlying Unicode characters. HTML entities may be in the form of character reference entities such as &​code; , decimal numeric character entities such as &​#169; , and hexadecimal numeric character entities such as &​#xA9; , all representing the copyright character © (Unicode code point 169) in the current example. (Note that the current handler is not a URL decoder, with which HTML decoders are sometimes confused. URL decoders handle percent-encoded strings like http://ascii.cl?parameter=%22Click+on+%27URL+Decode%27%21%22.)

The handler can decode the following:
    - Any valid decimal or hexadecimal numeric character entity
    - All 252 character reference entities in the HTML 4 entity set
    - The five predefined XML entities &​quot; , &​amp; , &​apos; , &​lt; , and &​gt; , representing the double quotation mark, ampersand, apostrophe, less-than sign, and greater-than sign (Unicode code points 34, 38, 39, 60, and 62, respectively), which apart from &​apos; are a subset of the HTML 4 entity set

The handler can process input text strings up to at least 1,000,000 characters in length and containing up to at least 8000 HTML entities (the limits of testing thus far). It takes advantage of several features to maximize execution speed: Applescript's text item delimiters for searching and replacing, the reconstruction of characters from their Unicode code points via the character id ... specifier, and hashed access to a master property list of character reference entities.

The handler takes an Applescript record as its input argument with one required and two optional boolean properties:
    Required property:
        htmlString - the text string to be decoded
    Optional boolean properties:
        condenseWhitespaces
            true - condenses consecutive space, tab, return, and/or linefeed characters (ASCII 32, 9, 13, and 10) in the input text string into a single space character as per the HTML 5 standard
            false (the default value if the property is omitted) - preserves white space characters exactly as they appear in the input string as per the HTML 4 standard
        decodeAngleBracketsPerHtml5
            true (the default value if the property is omitted) - decodes the character reference entities &​lang; and &​rang; as mathematical left and right angle brackets (Unicode code points 10216 and 10217) as per the HTML 5 standard
            false - decodes &​lang; and &​rang; as left- and right-pointing angle brackets (Unicode code points 9001 and 9002) as per the HTML 4 standard (now deprecated in the current Unicode standard)

The handler's return value is the decoded text string.

Several HTML decoding solutions are available on the Mac OS X platform, including Cocoa's NSAttributedString class used in association with the NSHTMLTextDocumentType attribute, the decode_entities function of Perl's HTML::Entities module, the unescape function of Python's HTMLParser module, and PHP's html_entity_decode function. The current handler offers the following potential advantages:
    (1) Offers more granular control over whitespace handling than the alternative solutions (to my knowledge, the Cocoa solution always implements HTML 5 behavior and condenses consecutive whitespace characters)
    (2) Does not strip HTML tags, in contrast with the Cocoa solution which strips HTML tags
    (3) For input text strings with fewer than about 250 HTML entities to be decoded:
            - Executes faster than the Perl, Python, and PHP solutions for input text strings smaller than about 100,000 characters in length
            - Executes at speeds similar to the Perl, Python, and PHP solutions for input text strings greater than about 100,000 characters in length
    (4) For input text strings with fewer than about 25 HTML entities to be decoded:
            - Executes faster than the Cocoa solution for input text strings smaller than about 10,000 characters in length
            - Executes at speeds similar to the Cocoa solution for input text strings greater than about 10,000 characters in length

The following are known limitations of the current handler:
    (1) Does not decode incompletely formed HTML entities, specifically those lacking the trailing semicolon character (this mimics
     the behavior of the PHP decoder)
    (2) Does not recognize the over 2000 new named character references in the HTML 5 standard, which represent a superset of the HTML 4 entity set and which seem thus far at least to be largely unimplemented on web pages (the hashed property list required to handle this many items would exceed Applescript's list size limit; NOTE: the current handler does recognize the newer entities - in fact, all HTML entities - if they are coded in decimal or hexadecimal numeric form)
    (3) Executes more slowly than the alternative solutions for input text strings with more than about 250 HTML entities to be decoded (for example, execution times to decode a Wikipedia article ~900,000 characters in length and containing ~4550 HTML entities: Perl ~0.5 sec, Cocoa ~1.5 sec, current Applescript handler ~2.5 sec)
    (4) Slows down significantly if the condenseWhitespaces input argument is set to true and the input text string contains a substantial number of long consecutive sequences of whitespace characters
   
HANDLER EXAMPLES HIGHLIGHTING ALTERNATIVE HTML ENTITY FORMS AND INPUT ARGUMENT SETTINGS:

Applescript:


set leftAngleBracket to "&" & "lang;"
set rightAngleBracket to "&" & "rang;"
set rightArrow to "&" & "#x2794;"
set ellipsis1 to "&" & "hellip;"
set ellipsis2 to "&" & "#8230;"
set ellipsis3 to "&" & "#x2026;"
set spades to "&" & "#9824;"
set hearts to "&" & "hearts;"
set clubs to "&" & "clubs;"
set diamonds to "&" & "#x2666;"

set str to leftAngleBracket & "Suits of a deck of cards" & rightAngleBracket & " " & rightArrow & tab & tab & tab & space & space & space & ellipsis1 & " " & spades & " " & ellipsis2 & " " & hearts & " " & ellipsis3 & " " & clubs & " " & ellipsis2 & " " & diamonds & " " & ellipsis1

decodeHtml({htmlString:str, condenseWhitespaces:false, decodeAngleBracketsPerHtml5:true})
-- or equivalently and taking advantage of default values for optional properties --
decodeHtml({htmlString:str}) -->
   "⟨Suits of a deck of cards⟩ âž”             … â™  … ♥ … ♣ … ♦ …"

decodeHtml({htmlString:str, condenseWhitespaces:false, decodeAngleBracketsPerHtml5:false}) -->
   "〈Suits of a deck of cards〉 âž”             … â™  … ♥ … ♣ … ♦ …"

decodeHtml({htmlString:str, condenseWhitespaces:true, decodeAngleBracketsPerHtml5:true}) -->
   "⟨Suits of a deck of cards⟩ âž” … â™  … ♥ … ♣ … ♦ …"

decodeHtml({htmlString:str, condenseWhitespaces:true, decodeAngleBracketsPerHtml5:false}) -->
   "〈Suits of a deck of cards〉 âž” … â™  … ♥ … ♣ … ♦ …"

(*    Notes:
       - "&" is separated from the remaining characters of the HTML entity in the initial "set" statements to prevent decoding by the web browser
       - ellipsis1, ellipsis2, and ellipsis3 are the character, decimal numeric, and hexadecimal numeric forms of the same ellipsis character (Unicode code point 8230)
       - Angle brackets are Unicode code points 10216 and 10217 in the first and third output strings, and Unicode code points 9001 and 9002 in the second and fourth output strings
*)

HANDLER EXAMPLE HIGHLIGHTING THE DETECTION OF INVALID HTML ENTITIES:

Applescript:


set validCharEntity to "&" & "copy;"
set invalidCharEntity to "&" & "copy123;"
set validDecEntity to "&" & "#169;"
set invalidDecEntity to "&" & "#A69;"
set validHexEntity to "&" & "#xA9;"
set invalidHexEntity to "&" & "#xZ9;"

set str to "" & ¬
   "Valid character entity: " & validCharEntity & return & ¬
   "Nonexistent character entity and thus not decoded: " & invalidCharEntity & return & ¬
   "Valid decimal numeric entity: " & validDecEntity & return & ¬
   "Invalid decimal numeric entity and thus not decoded: " & invalidDecEntity & return & ¬
   "Valid hexadecimal numeric entity: " & validHexEntity & return & ¬
   "Invalid hexadecimal numeric entity and thus not decoded: " & invalidHexEntity

decodeHtml({htmlString:str, condenseWhitespaces:false}) -->
   "Valid character entity: ©
   Nonexistent character entity and thus not decoded: &​copy123;
   Valid decimal numeric entity: ©
   Invalid decimal numeric entity and thus not decoded: &​#A69;
   Valid hexadecimal numeric entity: ©
   Invalid hexadecimal numeric entity and thus not decoded: &​#xZ9;"


(*    Note:
       - "&" is separated from the remaining characters of the HTML entity in the initial "set" statements to prevent decoding by the web browser
*)

HANDLER:

Applescript:


on decodeHtml(handlerArgument)
   (*
   - This handler decodes character, decimal numeric, and hexadecimal numeric html entities in an html string
   - It takes an Applescript record as its input argument with one required and two optional boolean properties:
       Required property:
           htmlString - the text string to be decoded
       Optional boolean properties:
           condenseWhitespaces
               true - condenses consecutive space, tab, return, and/or linefeed characters (ASCII 32, 9, 13, and 10) in the input text string into a single space character as per the HTML 5 standard
               false (the default value if the property is omitted) - preserves white space characters exactly as they appear in the input string as per the HTML 4 standard
           decodeAngleBracketsPerHtml5
               true (the default value if the property is omitted) - decodes the character reference entities "lang" and "rang" as mathematical left and right angle brackets (Unicode code points 10216 and 10217) as per the HTML 5 standard
               false - decodes "lang" and "rang" as left- and right-pointing angle brackets (Unicode code points 9001 and 9002) as per the HTML 4 standard (now deprecated in the current Unicode standard)
   - It returns the decoded html string
   *)

   -- Utility script
   script util
       -- Comprehensive character-entity-based hashed list of HTML-4/XHTML entities, each sublist containing a given html entity's character and decimal numeric forms in positions 1 and 2, respectively
       -- An entity's index in this list is generated by the hashFunction handler with the entity's character form as the handler argument
       -- The decimal numeric forms for "lang" and "rang" are initialized to the null value and will be set during runtime to either their HTML 4 ("9001", "9002") or HTML 5 ("10216", "10217") values depending on the value of the input argument decodeAngleBracketsPerHtml5
       property hashedHtmlEntities : {"", "", "", "", "", "", "", "", {"and", "8743"}, "", "", "", "", {"int", "8747"}, "", "", "", "", {"Rho", "929"}, "", "", "", "", "", {"iota", "953"}, "", "", "", {"psi", "968"}, {"prod", "8719"}, "", "", "", {"not", "172"}, {"prop", "8733"}, "", "", "", {"phi", "966"}, {"sdot", "8901"}, {"theta", "952"}, {"Scaron", "352"}, "", {"amp", "38"}, {"ensp", "8194"}, {"Theta", "920"}, {"there4", "8756"}, "", "", {"isin", "8712"}, "", {"thinsp", "8201"}, "", "", "", {"omega", "969"}, {"scaron", "353"}, "", "", "", {"trade", "8482"}, "", "", {"Chi", "935"}, "", {"thorn", "254"}, "", "", {"sup", "8835"}, {"emsp", "8195"}, {"prime", "8242"}, "", "", "", {"sup1", "185"}, {"image", "8465"}, "", "", "", {"supe", "8839"}, {"pound", "163"}, "", "", {"chi", "967"}, {"sup3", "179"}, {"notin", "8713"}, "", "", "", "", {"kappa", "954"}, "", "", {"eta", "951"}, "", {"Kappa", "922"}, {"otilde", "245"}, "", {"cup", "8746"}, {"sup2", "178"}, "", {"atilde", "227"}, {"Mu", "924"}, {"rho", "961"}, {"nbsp", "160"}, {"acute", "180"}, "", "", "", "", "", {"Ntilde", "209"}, {"or", "8744"}, {"loz", "9674"}, "", {"ocirc", "244"}, {"otimes", "8855"}, {"Nu", "925"}, "", "", {"acirc", "226"}, {"ntilde", "241"}, "", {"cap", "8745"}, "", {"icirc", "238"}, "", {"nu", "957"}, "", "", {"ecirc", "234"}, {"oacute", "243"}, "", {"Psi", "936"}, {"sube", "8838"}, {"asymp", "8776"}, {"aacute", "225"}, "", "", "", "", {"iacute", "237"}, "", {"Phi", "934"}, {"euro", "8364"}, {"exist", "8707"}, {"eacute", "233"}, "", "", {"ordm", "186"}, {"alpha", "945"}, {"Yacute", "221"}, "", {"Zeta", "918"}, {"nsub", "8836"}, "", {"Ccedil", "199"}, {"omicron", "959"}, {"zeta", "950"}, {"part", "8706"}, {"nabla", "8711"}, "", {"lt", "60"}, {"thetasym", "977"}, {"para", "182"}, {"Omega", "937"}, {"rsaquo", "8250"}, "", {"ordf", "170"}, "", {"oline", "8254"}, {"lsaquo", "8249"}, "", {"Eta", "919"}, "", {"Prime", "8243"}, {"ccedil", "231"}, "", {"sub", "8834"}, {"copy", "169"}, {"ucirc", "251"}, {"lowast", "8727"}, {"gt", "62"}, "", "", {"frac14", "188"}, {"ne", "8800"}, "", "", "", "", {"iquest", "191"}, "", {"tau", "964"}, {"Iota", "921"}, {"frac34", "190"}, {"uacute", "250"}, "", {"Tau", "932"}, {"cong", "8773"}, {"Gamma", "915"}, {"Lambda", "923"}, "", "", "", "", {"Otilde", "213"}, "", {"ETH", "208"}, {"infin", "8734"}, {"Ecirc", "202"}, "", "", {"beta", "946"}, "", {"Ucirc", "219"}, {"brvbar", "166"}, "", {"sect", "167"}, "", {"frac12", "189"}, {"curren", "164"}, "", {"cent", "162"}, "", {"Ocirc", "212"}, {"Eacute", "201"}, {"mu", "956"}, "", "", "", {"Uacute", "218"}, "", "", "", "", "", {"Xi", "926"}, {"ang", "8736"}, "", "", {"Oacute", "211"}, {"pi", "960"}, "", {"darr", "8595"}, {"equiv", "8801"}, {"yacute", "253"}, {"apos", "39"}, {"perp", "8869"}, "", "", "", "", "", {"delta", "948"}, {"radic", "8730"}, {"le", "8804"}, {"quot", "34"}, "", {"ouml", "246"}, {"crarr", "8629"}, "", {"ni", "8715"}, {"shy", "173"}, {"auml", "228"}, "", "", {"Omicron", "927"}, "", {"iuml", "239"}, {"aring", "229"}, {"Atilde", "195"}, "", "", {"euml", "235"}, {"diams", "9830"}, {"ge", "8805"}, "", "", {"Yuml", "376"}, {"empty", "8709"}, {"divide", "247"}, {"xi", "958"}, {"uml", "168"}, {"spades", "9824"}, {"clubs", "9827"}, {"dagger", "8224"}, "", {"Beta", "914"}, {"bull", "8226"}, {"Acirc", "194"}, {"lambda", "955"}, {"fnof", "402"}, {"sbquo", "8218"}, {"rang", null, "9002", "10217"}, {"Icirc", "206"}, "", {"alefsym", "8501"}, {"bdquo", "8222"}, {"lang", null, "9001", "10216"}, {"rceil", "8969"}, "", "", {"piv", "982"}, {"zwnj", "8204"}, {"lceil", "8968"}, {"Aacute", "193"}, "", {"sum", "8721"}, {"uarr", "8593"}, {"weierp", "8472"}, {"Iacute", "205"}, {"yen", "165"}, {"rsquo", "8217"}, {"Delta", "916"}, {"gamma", "947"}, "", "", {"lsquo", "8216"}, {"dArr", "8659"}, {"Alpha", "913"}, "", "", "", {"uuml", "252"}, "", "", "", {"rdquo", "8221"}, {"macr", "175"}, {"THORN", "222"}, "", "", {"ldquo", "8220"}, {"rarr", "8594"}, "", {"oslash", "248"}, "", {"real", "8476"}, {"larr", "8592"}, "", "", "", "", "", "", {"Dagger", "8225"}, {"Pi", "928"}, "", "", {"permil", "8240"}, {"plusmn", "177"}, "", "", {"Euml", "203"}, {"tilde", "732"}, {"middot", "183"}, "", {"oplus", "8853"}, {"Uuml", "220"}, {"Sigma", "931"}, {"ograve", "242"}, "", "", {"frasl", "8260"}, {"szlig", "223"}, {"agrave", "224"}, "", {"lrm", "8206"}, {"Ouml", "214"}, "", {"igrave", "236"}, "", {"raquo", "187"}, {"yuml", "255"}, {"sigma", "963"}, {"egrave", "232"}, {"deg", "176"}, {"laquo", "171"}, "", {"epsilon", "949"}, "", "", "", {"uArr", "8657"}, "", "", "", "", {"cedil", "184"}, {"hearts", "9829"}, "", "", "", {"iexcl", "161"}, {"times", "215"}, {"rfloor", "8971"}, "", "", "", "", {"lfloor", "8970"}, "", "", "", {"micro", "181"}, "", "", "", {"rArr", "8658"}, "", "", "", "", {"lArr", "8656"}, "", "", "", "", {"circ", "710"}, {"minus", "8722"}, "", "", "", "", "", {"ugrave", "249"}, "", "", "", {"upsilon", "965"}, "", "", "", {"Auml", "196"}, {"forall", "8704"}, "", "", "", {"Iuml", "207"}, {"Aring", "197"}, "", "", "", "", {"OElig", "338"}, {"Oslash", "216"}, "", "", "", "", "", "", "", "", "", {"Egrave", "200"}, "", "", {"harr", "8596"}, {"Epsilon", "917"}, {"Ugrave", "217"}, "", "", "", {"Upsilon", "933"}, "", {"reg", "174"}, {"rlm", "8207"}, "", "", {"Ograve", "210"}, "", "", {"oelig", "339"}, {"hellip", "8230"}, "", "", "", {"aelig", "230"}, {"ndash", "8211"}, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", {"sim", "8764"}, "", "", "", "", {"zwj", "8205"}, "", "", "", "", "", "", {"AElig", "198"}, "", "", {"eth", "240"}, "", {"sigmaf", "962"}, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", {"hArr", "8660"}, "", {"Agrave", "192"}, "", "", "", "", {"Igrave", "204"}, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", {"mdash", "8212"}, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", {"upsih", "978"}}
       -- Terms used by the hashFunction handler in generating a character entity's index in the hashedHtmlEntities list
       property hashFunctionTerms : {739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 10, 35, 20, 0, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 175, 135, 30, 60, 95, 5, 0, 5, 180, 739, 15, 5, 0, 15, 110, 110, 739, 5, 5, 5, 100, 739, 739, 0, 20, 0, 739, 739, 739, 739, 739, 739, 5, 60, 50, 0, 15, 144, 115, 215, 10, 225, 10, 95, 125, 25, 0, 5, 218, 90, 20, 0, 65, 35, 55, 45, 115, 5, 15, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739}
       -- Utility handlers
       on hashFunction(theKey)
           -- Hash function that takes as input a character reference entity and returns the hashedHtmlEntities list index and value for that key (or null values if the input key does not match a list item)
           -- The input key is the alphanumeric content of a character reference entity, e.g., "amp" for the ampersand character "&"
           -- The hash function was obtained by running the GNU gperf perfect hash function generator on the complete list of HTML-4/XHTML character reference entities and transforming the C code into equivalent Applescript code
           -- The hash function uses the hashFunctionTerms numeric list in the construction of the list index from the key
           try
               tell (get theKey's id)
                   set keyLength to length
                   set hashIndex to (my hashFunctionTerms's item ((item keyLength) + 1)) + keyLength + 1
                   if keyLength > 0 then
                       set hashIndex to hashIndex + (my hashFunctionTerms's item ((item 1) + 1))
                       if keyLength > 1 then
                           set hashIndex to hashIndex + (my hashFunctionTerms's item ((item 2) + 2))
                           if keyLength > 2 then
                               set hashIndex to hashIndex + (my hashFunctionTerms's item ((item 3) + 1))
                               if keyLength > 4 then
                                   set hashIndex to hashIndex + (my hashFunctionTerms's item ((item 5) + 1))
                               end if
                           end if
                       end if
                   end if
               end tell
               set htmlEntity to my hashedHtmlEntities's item hashIndex
           on error
               set {hashIndex, htmlEntity} to {null, null}
           end try
           return {hashIndex:hashIndex, htmlEntity:htmlEntity}
       end hashFunction
       on hexToDec(hexString)
           -- Hexadecimal-to-decimal string converter
           -- Adapted from Nigel Garvey's hex-to-dec converter ([url]http://macscripter.net/viewtopic.php?pid=71052[/url])
           set refString to "0123456789abcdef"
           set tid to AppleScript's text item delimiters
           try
               tell hexString
                   if its class ≠ text then error
                   if it = "" then return ""
               end tell
               set decVal to 0
               ignoring case
                   repeat with i in (get hexString's characters)
                       set AppleScript's text item delimiters to i's contents
                       tell refString's text item 1's length
                           if it = 16 then error
                           set decVal to decVal * 16 + it
                       end tell
                   end repeat
               end ignoring
               set decimalString to decVal as text
           on error
               set AppleScript's text item delimiters to tid
               error
           end try
           set AppleScript's text item delimiters to tid
           return decimalString
       end hexToDec
       on reduceConsecutiveWhitespaces(theText)
           set tid to AppleScript's text item delimiters
           try
               set AppleScript's text item delimiters to {tab, return, linefeed}
               tell theText's text items
                   set AppleScript's text item delimiters to space
                   set theText to it as text
               end tell
               repeat
                   set AppleScript's text item delimiters to space & space
                   tell theText's text items
                       if length = 1 then exit repeat
                       set AppleScript's text item delimiters to space
                       set theText to it as text
                   end tell
               end repeat
           end try
           set AppleScript's text item delimiters to tid
           return theText
       end reduceConsecutiveWhitespaces
   end script
   -- Wrap the code in a try block to capture any errors
   set tid to AppleScript's text item delimiters
   try
       -- Process the handler argument
       try
           tell (handlerArgument & {condenseWhitespaces:false, decodeAngleBracketsPerHtml5:true})
               if length ≠ 3 then error
               set {htmlString, condenseWhitespaces, decodeAngleBracketsPerHtml5} to {its htmlString, its condenseWhitespaces, its decodeAngleBracketsPerHtml5}
           end tell
       on error
           error "The handler argument must be a record with the following required property label:" & return & return & tab & "htmlString" & return & return & "and the following optional property labels:" & return & return & tab & "condenseWhitespaces" & return & tab & "decodeAngleBracketsPerHtml5"
       end try
       if htmlString's class ≠ text then error "The input argument htmlString is not a text string."
       if condenseWhitespaces's class ≠ boolean then error "The input argument condenseWhitespaces must be either true or false."
       if decodeAngleBracketsPerHtml5's class ≠ boolean then error "The input argument decodeAngleBracketsPerHtml5 must be either true or false."
       -- Tailor the hashed html entities list to conform to the decodeAngleBracketsPerHtml5 setting
       set {i1, i2} to {util's hashFunction("lang")'s hashIndex, util's hashFunction("rang")'s hashIndex}
       tell (3 + (decodeAngleBracketsPerHtml5 as integer)) to set {util's hashedHtmlEntities's item i1's item 2, util's hashedHtmlEntities's item i2's item 2} to {util's hashedHtmlEntities's item i1's item it, util's hashedHtmlEntities's item i2's item it}
       -- Split the string at each "&" character
       set AppleScript's text item delimiters to "&"
       tell (get htmlString's text items)
           -- Handle the case of an html string without "&" characters
           if length < 2 then return htmlString
           -- Begin the decoded string with the text preceding the first "&" character
           set {decodedSubstrings, ampersandPrefixedSubstrings} to {{item 1}, rest}
       end tell
       -- Process the &-prefixed substrings
       repeat with currAmpersandPrefixedSubstring in ampersandPrefixedSubstrings
           try
               -- Split the current &-prefixed substring at each ";" character
               set AppleScript's text item delimiters to ";"
               tell (get currAmpersandPrefixedSubstring's text items)
                   -- Handle the case of an &-prefixed substring without ";" characters
                   if length < 2 then error
                   -- Get the text preceding the first ";" character, which is the html entity candidate
                   set {htmlEntityCandidate, semicolonPrefixedSubstrings} to {item 1, rest}
               end tell
               -- Test if the current html entity candidate is valid
               if htmlEntityCandidate starts with "#x" then
                   -- Try processing the current html entity candidate as a hexadecimal numeric entity, and get its equivalent decimal value
                   set decimalString to util's hexToDec(htmlEntityCandidate's text 3 thru -1)
               else if htmlEntityCandidate starts with "#" then
                   -- Else try processing the current html entity candidate as a decimal numeric entity, and get its decimal value
                   tell htmlEntityCandidate's text 2 thru -1
                       ignoring white space
                           -- Flag as invalid a decimal numeric entity that starts or ends with a whitespace character
                           if (text 1 = "") or (text -1 = "") then error
                       end ignoring
                       set decimalString to it
                   end tell
               else
                   -- Else try processing the current html entity candidate as a character entity, and get its corresponding decimal value from the hashed html entities list
                   tell htmlEntityCandidate's length to if (it < 2) or (it > 8) then error -- flags as invalid a character entity if its alphanumeric component's string length is invalid
                   set hashedHtmlEntity to util's hashFunction(htmlEntityCandidate)'s htmlEntity
                   considering case
                       -- Confirm that the hash function returned the proper html entity
                       if htmlEntityCandidate ≠ hashedHtmlEntity's item 1 then error
                   end considering
                   set decimalString to hashedHtmlEntity's item 2
               end if
               -- Replace the html entity at the start of the current &-prefixed substring with its underlying Unicode character, and append the modified substring to the decoded string; if the replacement fails, the error will leave the current substring in its original form
               set end of decodedSubstrings to (character id decimalString) & (semicolonPrefixedSubstrings as text)
           on error
               -- If any error is encountered, thus signifying that the current &-prefixed substring does not start with a valid html entity, leave the current substring in its original form
               set end of decodedSubstrings to "&" & currAmpersandPrefixedSubstring
           end try
       end repeat
       -- Transform the list of substrings into a single decoded text string
       set AppleScript's text item delimiters to ""
       set decodedString to decodedSubstrings as text
       -- Condense consecutive whitespace characters if specified in the input argument condenseWhitespaces
       if condenseWhitespaces then set decodedString to util's reduceConsecutiveWhitespaces(decodedString)
   on error m number n
       set AppleScript's text item delimiters to tid
       if n = -128 then error number -128
       if n ≠ -2700 then set m to "(" & n & ") " & m
       error ("Problem with handler decodeHtml:" & return & return & m) number n
   end try
   set AppleScript's text item delimiters to tid
   -- Return the results
   return decodedString
end decodeHtml

Last edited by bmose (2017-06-10 07:10:09 am)


Filed under: HTML, html entity, decoder

Offline

 

#2 2017-06-08 08:33:03 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4334

Re: Native Applescript HTML entity decoder

Wow. That looks like a pretty comprehensive vanilla solution!  cool

It's somewhat easier in ASObjC, which leaves all the decoding effort to methods built into Mac OS's Foundation framework:

Applescript:

use AppleScript version "2.4" -- Mac OS 10.10 (Yosemite) or later
use framework "Foundation"

on decodeHtml(handlerArgument)
   try
       -- Set variables to the arguments, using default values for any omitted. (Defaults for space condensation and angle bracket as per HTML5.)
       set {htmlString:str, condenseWhitespaces:condenseWhitespaces, decodeAngleBracketsPerHtml5:decodeAngleBracketsPerHtml5} to handlerArgument & {htmlString:missing value, condenseWhitespaces:true, decodeAngleBracketsPerHtml5:true}
       if (str is missing value) then error
   on error
       error "The handler argument must be a record with the following required property label:" & return & return & tab & "htmlString" & return & return & "and the following optional property labels:" & return & return & tab & "condenseWhitespaces" & return & tab & "decodeAngleBracketsPerHtml5"
   end try
   
   set |⌘| to current application
   -- Get an NSMutableString version of the input string.
   set str to |⌘|'s class "NSMutableString"'s stringWithString:(str)
   -- If not condensing white spaces, replace them with HTML equivalents.
   if (not condenseWhitespaces) then
       -- (The concatentations shown here are only needed when displaying the script code on a Web site. The entities themselves can be used otherwise.)
       tell str to replaceOccurrencesOfString:(space) withString:("&" & "nbsp;") options:(0) range:({0, its |length|()})
       tell str to replaceOccurrencesOfString:(tab) withString:("&" & "#9;") options:(0) range:({0, its |length|()})
       tell str to replaceOccurrencesOfString:("\\R") withString:("<br />") options:(|⌘|'s NSRegularExpressionSearch) range:({0, its |length|()})
   end if
   -- Derive an NSData object from the HTML string and an NSAttributedString from that.
   set HTMLData to str's dataUsingEncoding:(|⌘|'s NSUTF8StringEncoding)
   set attributedStr to |⌘|'s class "NSAttributedString"'s alloc()'s initWithHTML:(HTMLData) documentAttributes:(missing value)
   -- Read off the decoded string from the NSAttributedString.
   set decodedString to attributedStr's |string|()
   -- Any angle brackets in the result are HTML5 interpretations. Replace them with the other type if required.
   if (not decodeAngleBracketsPerHtml5) then
       set decodedString to decodedString's stringByReplacingOccurrencesOfString:(character id 10216) withString:(character id 9001)
       set decodedString to decodedString's stringByReplacingOccurrencesOfString:(character id 10217) withString:(character id 9002)
   end if
   
   -- Return the final result as AppleScript text.
   return decodedString as text
end decodeHtml


NG

Offline

 

#3 2017-06-08 08:52:22 am

bmose
Member
From:: Massachusetts
Registered: 2006-01-03
Posts: 195

Re: Native Applescript HTML entity decoder

Thanks, Nigel, and wow, what a creative way to preserve whitespaces with the ASObjC decoder! That was one of the problems that prompted me to write an Applescript solution. The other side-effect of the ASObjcC decoder is that it strips away HTML tags. I often use those tags as handles for regular expression searches of downloaded web pages. The only Cocoa solution I could find involves Core Foundation's CFXMLCreateStringByUnescapingEntities function. If that could be bridged to Applescript, that might be yet another good solution.

Offline

 

#4 2017-06-08 09:29:18 am

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2665
Website

Re: Native Applescript HTML entity decoder

bmose wrote:

The only Cocoa solution I could find involves Core Foundation's CFXMLCreateStringByUnescapingEntities function. If that could be bridged to Applescript, that might be yet another good solution.


When using xml entities you should use the HTML entities DTD list which can be found here: https://www.w3.org/TR/xhtml-modularizat … r_entities

Offline

 

#5 2017-06-08 11:33:10 am

bmose
Member
From:: Massachusetts
Registered: 2006-01-03
Posts: 195

Re: Native Applescript HTML entity decoder

Thanks for the link, DJ. Is there any way of executing the CFXMLCreateStringByUnescapingEntities function from an Applescript script directly or indirectly that does not involve a full-fledged Cocoa application?

Offline

 

#6 2017-06-08 02:23:27 pm

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4334

Re: Native Applescript HTML entity decoder

bmose wrote:

The other side-effect of the ASObjcC decoder is that it strips away HTML tags. I often use those tags as handles for regular expression searches of downloaded web pages.


I suppose it depends on what your ultimate aim is. In a script I use to get the content of MacScripter thread pages, formatted in a certain way as plain text, I use the tags to identify the sections I want to edit, do the edits, then delete all irrelevant tags and run whatever's left through an NSAttributedString. If your aim's just to convert HTML entities but leave the tags in place, you might get away with entitising (ouch!) the tag brackets too in the script above:

Applescript:

if (keepingTags) then
   tell str to replaceOccurrencesOfString:("<") withString:("&" & "lt;") options:(0) range:({0, its |length|()})
   tell str to replaceOccurrencesOfString:(">") withString:("&" & "gt;") options:(0) range:({0, its |length|()})
end if


NG

Offline

 

#7 2017-06-08 06:47:25 pm

Shane Stanley
Member
From:: Australia
Registered: 2002-12-07
Posts: 5046

Re: Native Applescript HTML entity decoder

bmose wrote:

Thanks for the link, DJ. Is there any way of executing the CFXMLCreateStringByUnescapingEntities function from an Applescript script directly or indirectly that does not involve a full-fledged Cocoa application?


You can wrap it in a framework. There are several open source third-party frameworks with categories on NSString to do what you want -- you could build your own framework from one of those.


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/

Offline

 

#8 2017-06-09 12:08:24 am

bmose
Member
From:: Massachusetts
Registered: 2006-01-03
Posts: 195

Re: Native Applescript HTML entity decoder

Nigel Garvey wrote:

If your aim's just to convert HTML entities but leave the tags in place, you might get away with entitising (ouch!) the tag brackets too


Thanks for another creative entitization (:lol:) suggestion. I tend to parse as you do: use tags as search handles -> extract the desired text -> decode HTML entities. My focus on preserving HTML tags is more for robustness so that that option is available for some future need.

Shane Stanley wrote:

There are several open source third-party frameworks with categories on NSString


That sounds like a great solution for the current task and also a powerful tool in general. I don't have experience creating frameworks. Could you possibly suggest a specific link that might be particularly helpful for someone like myself climbing up the learning curve?

Last edited by bmose (2017-06-09 12:09:20 am)

Offline

 

#9 2017-06-09 01:04:50 am

Shane Stanley
Member
From:: Australia
Registered: 2002-12-07
Posts: 5046

Re: Native Applescript HTML entity decoder

bmose wrote:

I don't have experience creating frameworks. Could you possibly suggest a specific link that might be particularly helpful for someone like myself climbing up the learning curve?


If you've used Xcode at all, it's pretty simple. Assuming you have tracked down suitable Objective-C files, you create a new project in Xcode, choose macOS and Cocoa Framework as the template, then add your Objective-C .h and .m files to the project. The only settings you may need to change are for deployment target, and what headers are exposed (Build Phases -> Headers, and make public what you want exposed).

So in theory you can go to something like this <https://github.com/mwaterfall/MWFeedParser>, download it, copy NSString+HTML.h, NSString+HTML.m, GTMNSString+HTML.h and GTMNSString+HTML.m to your project (plus the required copyright attributions), Build (For Profiling), put it in ~/Library/Frameworks and use it like this:

Applescript:

use framework "Foundation"
use framework "NameOfFramework"
use scripting additions

on decodeHtml(handlerArgument)
   set str to htmlString of handlerArgument
   set str to current application's NSString's stringWithString:str
   return str's stringByDecodingHTMLEntities() as string
end decodeHtml

Because the code is a category that extends NSString rather than adding a new class, that's it.

However... that's not a totally good idea. The problem is that when you use that in scripts run from app menus, you're loading the framework into the host app. And it's bad form to add categories like that -- it's probably safe, but there's a small element of risk. I wouldn't distribute it that way. You're better off changing the categories to new classes with your own prefix. It's more work to call them that way, but it's safer. (This is what I do with SMSForder in BridgePlus).


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/

Offline

 

#10 2017-06-09 06:44:10 am

bmose
Member
From:: Massachusetts
Registered: 2006-01-03
Posts: 195

Re: Native Applescript HTML entity decoder

Thanks for the link and the very helpful instructions. It opens up so many possibilities. I appreciate your safety advice about categories and will try to get into the habit of making my own classes from the start.

Offline

 

Board footer

Powered by FluxBB

RSS (new topics) RSS (active topics)