Native Applescript HTML entity decoder

bmose · June 8, 2017, 10:37am

NOTE: In the discussion section only (not the handler code), an ASCII 0 null character has been inserted between “&” and the remaining characters of HTML entities to prevent the HTML entities from being decoded into their original Unicode characters by the web browser. This won’t affect the “Open this Scriptlet in your editor” sections, whose Applescript code may be copied and used as is.

The current handler is an HTML decoder written in Applescript. It converts HTML entities in an HTML document to their underlying Unicode characters. HTML entities may be in the form of character reference entities such as &code; , decimal numeric character entities such as © , and hexadecimal numeric character entities such as © , all representing the copyright character © (Unicode code point 169) in the current example. (Note that the current handler is not a URL decoder, with which HTML decoders are sometimes confused. URL decoders handle percent-encoded strings like [b][i]http://ascii.cl?parameter="Click+on+'URL+Decode'!"[/i][/b].)

The handler can decode the following:
- Any valid decimal or hexadecimal numeric character entity
- All 252 character reference entities in the HTML 4 entity set
- The five predefined XML entities " , & , ’ , < , and > , representing the double quotation mark, ampersand, apostrophe, less-than sign, and greater-than sign (Unicode code points 34, 38, 39, 60, and 62, respectively), which apart from ’ are a subset of the HTML 4 entity set

The handler can process input text strings up to at least 1,000,000 characters in length and containing up to at least 8000 HTML entities (the limits of testing thus far). It takes advantage of several features to maximize execution speed: Applescript’s text item delimiters for searching and replacing, the reconstruction of characters from their Unicode code points via the character id … specifier, and hashed access to a master property list of character reference entities.

The handler takes an Applescript record as its input argument with one required and two optional boolean properties:
Required property:
htmlString - the text string to be decoded
Optional boolean properties:
condenseWhitespaces
true - condenses consecutive space, tab, return, and/or linefeed characters (ASCII 32, 9, 13, and 10) in the input text string into a single space character as per the HTML 5 standard
false (the default value if the property is omitted) - preserves white space characters exactly as they appear in the input string as per the HTML 4 standard
decodeAngleBracketsPerHtml5
true (the default value if the property is omitted) - decodes the character reference entities ⟨ and ⟩ as mathematical left and right angle brackets (Unicode code points 10216 and 10217) as per the HTML 5 standard
false - decodes ⟨ and ⟩ as left- and right-pointing angle brackets (Unicode code points 9001 and 9002) as per the HTML 4 standard (now deprecated in the current Unicode standard)

The handler’s return value is the decoded text string.

Several HTML decoding solutions are available on the Mac OS X platform, including Cocoa’s NSAttributedString class used in association with the NSHTMLTextDocumentType attribute, the decode_entities function of Perl’s HTML::Entities module, the unescape function of Python’s HTMLParser module, and PHP’s html_entity_decode function. The current handler offers the following potential advantages:
(1) Offers more granular control over whitespace handling than the alternative solutions (to my knowledge, the Cocoa solution always implements HTML 5 behavior and condenses consecutive whitespace characters)
(2) Does not strip HTML tags, in contrast with the Cocoa solution which strips HTML tags
(3) For input text strings with fewer than about 250 HTML entities to be decoded:
- Executes faster than the Perl, Python, and PHP solutions for input text strings smaller than about 100,000 characters in length
- Executes at speeds similar to the Perl, Python, and PHP solutions for input text strings greater than about 100,000 characters in length
(4) For input text strings with fewer than about 25 HTML entities to be decoded:
- Executes faster than the Cocoa solution for input text strings smaller than about 10,000 characters in length
- Executes at speeds similar to the Cocoa solution for input text strings greater than about 10,000 characters in length

The following are known limitations of the current handler:
(1) Does not decode incompletely formed HTML entities, specifically those lacking the trailing semicolon character (this mimics
the behavior of the PHP decoder)
(2) Does not recognize the over 2000 new named character references in the HTML 5 standard, which represent a superset of the HTML 4 entity set and which seem thus far at least to be largely unimplemented on web pages (the hashed property list required to handle this many items would exceed Applescript’s list size limit; NOTE: the current handler does recognize the newer entities - in fact, all HTML entities - if they are coded in decimal or hexadecimal numeric form)
(3) Executes more slowly than the alternative solutions for input text strings with more than about 250 HTML entities to be decoded (for example, execution times to decode a Wikipedia article ~900,000 characters in length and containing ~4550 HTML entities: Perl ~0.5 sec, Cocoa ~1.5 sec, current Applescript handler ~2.5 sec)
(4) Slows down significantly if the condenseWhitespaces input argument is set to true and the input text string contains a substantial number of long consecutive sequences of whitespace characters

HANDLER EXAMPLES HIGHLIGHTING ALTERNATIVE HTML ENTITY FORMS AND INPUT ARGUMENT SETTINGS:


set leftAngleBracket to "&" & "lang;"
set rightAngleBracket to "&" & "rang;"
set rightArrow to "&" & "#x2794;"
set ellipsis1 to "&" & "hellip;"
set ellipsis2 to "&" & "#8230;"
set ellipsis3 to "&" & "#x2026;"
set spades to "&" & "#9824;"
set hearts to "&" & "hearts;"
set clubs to "&" & "clubs;"
set diamonds to "&" & "#x2666;"

set str to leftAngleBracket & "Suits of a deck of cards" & rightAngleBracket & " " & rightArrow & tab & tab & tab & space & space & space & ellipsis1 & " " & spades & " " & ellipsis2 & " " & hearts & " " & ellipsis3 & " " & clubs & " " & ellipsis2 & " " & diamonds & " " & ellipsis1

decodeHtml({htmlString:str, condenseWhitespaces:false, decodeAngleBracketsPerHtml5:true})
-- or equivalently and taking advantage of default values for optional properties --
decodeHtml({htmlString:str}) -->
	"⟨Suits of a deck of cards⟩ ➔			   … ♠ … ♥ … ♣ … ♦ …"

decodeHtml({htmlString:str, condenseWhitespaces:false, decodeAngleBracketsPerHtml5:false}) -->
	"〈Suits of a deck of cards〉 ➔			   … ♠ … ♥ … ♣ … ♦ …"

decodeHtml({htmlString:str, condenseWhitespaces:true, decodeAngleBracketsPerHtml5:true}) -->
	"⟨Suits of a deck of cards⟩ ➔ … ♠ … ♥ … ♣ … ♦ …"

decodeHtml({htmlString:str, condenseWhitespaces:true, decodeAngleBracketsPerHtml5:false}) -->
	"〈Suits of a deck of cards〉 ➔ … ♠ … ♥ … ♣ … ♦ …"

(*	Notes:
		- "&" is separated from the remaining characters of the HTML entity in the initial "set" statements to prevent decoding by the web browser
		- ellipsis1, ellipsis2, and ellipsis3 are the character, decimal numeric, and hexadecimal numeric forms of the same ellipsis character (Unicode code point 8230)
		- Angle brackets are Unicode code points 10216 and 10217 in the first and third output strings, and Unicode code points 9001 and 9002 in the second and fourth output strings
*)

HANDLER EXAMPLE HIGHLIGHTING THE DETECTION OF INVALID HTML ENTITIES:


set validCharEntity to "&" & "copy;"
set invalidCharEntity to "&" & "copy123;"
set validDecEntity to "&" & "#169;"
set invalidDecEntity to "&" & "#A69;"
set validHexEntity to "&" & "#xA9;"
set invalidHexEntity to "&" & "#xZ9;"

set str to "" & ¬
	"Valid character entity: " & validCharEntity & return & ¬
	"Nonexistent character entity and thus not decoded: " & invalidCharEntity & return & ¬
	"Valid decimal numeric entity: " & validDecEntity & return & ¬
	"Invalid decimal numeric entity and thus not decoded: " & invalidDecEntity & return & ¬
	"Valid hexadecimal numeric entity: " & validHexEntity & return & ¬
	"Invalid hexadecimal numeric entity and thus not decoded: " & invalidHexEntity

decodeHtml({htmlString:str, condenseWhitespaces:false}) -->
	"Valid character entity: ©
	Nonexistent character entity and thus not decoded: &copy123;
	Valid decimal numeric entity: ©
	Invalid decimal numeric entity and thus not decoded: &#A69;
	Valid hexadecimal numeric entity: ©
	Invalid hexadecimal numeric entity and thus not decoded: &#xZ9;"

(*	Note:
		- "&" is separated from the remaining characters of the HTML entity in the initial "set" statements to prevent decoding by the web browser
*)

HANDLER:


on decodeHtml(handlerArgument)
	(*
   - This handler decodes character, decimal numeric, and hexadecimal numeric html entities in an html string
   - It takes an Applescript record as its input argument with one required and two optional boolean properties:
       Required property:
           htmlString - the text string to be decoded
       Optional boolean properties:
           condenseWhitespaces
               true - condenses consecutive space, tab, return, and/or linefeed characters (ASCII 32, 9, 13, and 10) in the input text string into a single space character as per the HTML 5 standard
               false (the default value if the property is omitted) - preserves white space characters exactly as they appear in the input string as per the HTML 4 standard
           decodeAngleBracketsPerHtml5
               true (the default value if the property is omitted) - decodes the character reference entities "lang" and "rang" as mathematical left and right angle brackets (Unicode code points 10216 and 10217) as per the HTML 5 standard
               false - decodes "lang" and "rang" as left- and right-pointing angle brackets (Unicode code points 9001 and 9002) as per the HTML 4 standard (now deprecated in the current Unicode standard)
   - It returns the decoded html string
   *)
	-- Utility script
	script util
		-- Comprehensive character-entity-based hashed list of HTML-4/XHTML entities, each sublist containing a given html entity's character and decimal numeric forms in positions 1 and 2, respectively
		-- An entity's index in this list is generated by the hashFunction handler with the entity's character form as the handler argument
		-- The decimal numeric forms for "lang" and "rang" are initialized to the null value and will be set during runtime to either their HTML 4 ("9001", "9002") or HTML 5 ("10216", "10217") values depending on the value of the input argument decodeAngleBracketsPerHtml5
		property hashedHtmlEntities : {"", "", "", "", "", "", "", "", {"and", "8743"}, "", "", "", "", {"int", "8747"}, "", "", "", "", {"Rho", "929"}, "", "", "", "", "", {"iota", "953"}, "", "", "", {"psi", "968"}, {"prod", "8719"}, "", "", "", {"not", "172"}, {"prop", "8733"}, "", "", "", {"phi", "966"}, {"sdot", "8901"}, {"theta", "952"}, {"Scaron", "352"}, "", {"amp", "38"}, {"ensp", "8194"}, {"Theta", "920"}, {"there4", "8756"}, "", "", {"isin", "8712"}, "", {"thinsp", "8201"}, "", "", "", {"omega", "969"}, {"scaron", "353"}, "", "", "", {"trade", "8482"}, "", "", {"Chi", "935"}, "", {"thorn", "254"}, "", "", {"sup", "8835"}, {"emsp", "8195"}, {"prime", "8242"}, "", "", "", {"sup1", "185"}, {"image", "8465"}, "", "", "", {"supe", "8839"}, {"pound", "163"}, "", "", {"chi", "967"}, {"sup3", "179"}, {"notin", "8713"}, "", "", "", "", {"kappa", "954"}, "", "", {"eta", "951"}, "", {"Kappa", "922"}, {"otilde", "245"}, "", {"cup", "8746"}, {"sup2", "178"}, "", {"atilde", "227"}, {"Mu", "924"}, {"rho", "961"}, {"nbsp", "160"}, {"acute", "180"}, "", "", "", "", "", {"Ntilde", "209"}, {"or", "8744"}, {"loz", "9674"}, "", {"ocirc", "244"}, {"otimes", "8855"}, {"Nu", "925"}, "", "", {"acirc", "226"}, {"ntilde", "241"}, "", {"cap", "8745"}, "", {"icirc", "238"}, "", {"nu", "957"}, "", "", {"ecirc", "234"}, {"oacute", "243"}, "", {"Psi", "936"}, {"sube", "8838"}, {"asymp", "8776"}, {"aacute", "225"}, "", "", "", "", {"iacute", "237"}, "", {"Phi", "934"}, {"euro", "8364"}, {"exist", "8707"}, {"eacute", "233"}, "", "", {"ordm", "186"}, {"alpha", "945"}, {"Yacute", "221"}, "", {"Zeta", "918"}, {"nsub", "8836"}, "", {"Ccedil", "199"}, {"omicron", "959"}, {"zeta", "950"}, {"part", "8706"}, {"nabla", "8711"}, "", {"lt", "60"}, {"thetasym", "977"}, {"para", "182"}, {"Omega", "937"}, {"rsaquo", "8250"}, "", {"ordf", "170"}, "", {"oline", "8254"}, {"lsaquo", "8249"}, "", {"Eta", "919"}, "", {"Prime", "8243"}, {"ccedil", "231"}, "", {"sub", "8834"}, {"copy", "169"}, {"ucirc", "251"}, {"lowast", "8727"}, {"gt", "62"}, "", "", {"frac14", "188"}, {"ne", "8800"}, "", "", "", "", {"iquest", "191"}, "", {"tau", "964"}, {"Iota", "921"}, {"frac34", "190"}, {"uacute", "250"}, "", {"Tau", "932"}, {"cong", "8773"}, {"Gamma", "915"}, {"Lambda", "923"}, "", "", "", "", {"Otilde", "213"}, "", {"ETH", "208"}, {"infin", "8734"}, {"Ecirc", "202"}, "", "", {"beta", "946"}, "", {"Ucirc", "219"}, {"brvbar", "166"}, "", {"sect", "167"}, "", {"frac12", "189"}, {"curren", "164"}, "", {"cent", "162"}, "", {"Ocirc", "212"}, {"Eacute", "201"}, {"mu", "956"}, "", "", "", {"Uacute", "218"}, "", "", "", "", "", {"Xi", "926"}, {"ang", "8736"}, "", "", {"Oacute", "211"}, {"pi", "960"}, "", {"darr", "8595"}, {"equiv", "8801"}, {"yacute", "253"}, {"apos", "39"}, {"perp", "8869"}, "", "", "", "", "", {"delta", "948"}, {"radic", "8730"}, {"le", "8804"}, {"quot", "34"}, "", {"ouml", "246"}, {"crarr", "8629"}, "", {"ni", "8715"}, {"shy", "173"}, {"auml", "228"}, "", "", {"Omicron", "927"}, "", {"iuml", "239"}, {"aring", "229"}, {"Atilde", "195"}, "", "", {"euml", "235"}, {"diams", "9830"}, {"ge", "8805"}, "", "", {"Yuml", "376"}, {"empty", "8709"}, {"divide", "247"}, {"xi", "958"}, {"uml", "168"}, {"spades", "9824"}, {"clubs", "9827"}, {"dagger", "8224"}, "", {"Beta", "914"}, {"bull", "8226"}, {"Acirc", "194"}, {"lambda", "955"}, {"fnof", "402"}, {"sbquo", "8218"}, {"rang", null, "9002", "10217"}, {"Icirc", "206"}, "", {"alefsym", "8501"}, {"bdquo", "8222"}, {"lang", null, "9001", "10216"}, {"rceil", "8969"}, "", "", {"piv", "982"}, {"zwnj", "8204"}, {"lceil", "8968"}, {"Aacute", "193"}, "", {"sum", "8721"}, {"uarr", "8593"}, {"weierp", "8472"}, {"Iacute", "205"}, {"yen", "165"}, {"rsquo", "8217"}, {"Delta", "916"}, {"gamma", "947"}, "", "", {"lsquo", "8216"}, {"dArr", "8659"}, {"Alpha", "913"}, "", "", "", {"uuml", "252"}, "", "", "", {"rdquo", "8221"}, {"macr", "175"}, {"THORN", "222"}, "", "", {"ldquo", "8220"}, {"rarr", "8594"}, "", {"oslash", "248"}, "", {"real", "8476"}, {"larr", "8592"}, "", "", "", "", "", "", {"Dagger", "8225"}, {"Pi", "928"}, "", "", {"permil", "8240"}, {"plusmn", "177"}, "", "", {"Euml", "203"}, {"tilde", "732"}, {"middot", "183"}, "", {"oplus", "8853"}, {"Uuml", "220"}, {"Sigma", "931"}, {"ograve", "242"}, "", "", {"frasl", "8260"}, {"szlig", "223"}, {"agrave", "224"}, "", {"lrm", "8206"}, {"Ouml", "214"}, "", {"igrave", "236"}, "", {"raquo", "187"}, {"yuml", "255"}, {"sigma", "963"}, {"egrave", "232"}, {"deg", "176"}, {"laquo", "171"}, "", {"epsilon", "949"}, "", "", "", {"uArr", "8657"}, "", "", "", "", {"cedil", "184"}, {"hearts", "9829"}, "", "", "", {"iexcl", "161"}, {"times", "215"}, {"rfloor", "8971"}, "", "", "", "", {"lfloor", "8970"}, "", "", "", {"micro", "181"}, "", "", "", {"rArr", "8658"}, "", "", "", "", {"lArr", "8656"}, "", "", "", "", {"circ", "710"}, {"minus", "8722"}, "", "", "", "", "", {"ugrave", "249"}, "", "", "", {"upsilon", "965"}, "", "", "", {"Auml", "196"}, {"forall", "8704"}, "", "", "", {"Iuml", "207"}, {"Aring", "197"}, "", "", "", "", {"OElig", "338"}, {"Oslash", "216"}, "", "", "", "", "", "", "", "", "", {"Egrave", "200"}, "", "", {"harr", "8596"}, {"Epsilon", "917"}, {"Ugrave", "217"}, "", "", "", {"Upsilon", "933"}, "", {"reg", "174"}, {"rlm", "8207"}, "", "", {"Ograve", "210"}, "", "", {"oelig", "339"}, {"hellip", "8230"}, "", "", "", {"aelig", "230"}, {"ndash", "8211"}, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", {"sim", "8764"}, "", "", "", "", {"zwj", "8205"}, "", "", "", "", "", "", {"AElig", "198"}, "", "", {"eth", "240"}, "", {"sigmaf", "962"}, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", {"hArr", "8660"}, "", {"Agrave", "192"}, "", "", "", "", {"Igrave", "204"}, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", {"mdash", "8212"}, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", {"upsih", "978"}}
		-- Terms used by the hashFunction handler in generating a character entity's index in the hashedHtmlEntities list
		property hashFunctionTerms : {739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 10, 35, 20, 0, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 175, 135, 30, 60, 95, 5, 0, 5, 180, 739, 15, 5, 0, 15, 110, 110, 739, 5, 5, 5, 100, 739, 739, 0, 20, 0, 739, 739, 739, 739, 739, 739, 5, 60, 50, 0, 15, 144, 115, 215, 10, 225, 10, 95, 125, 25, 0, 5, 218, 90, 20, 0, 65, 35, 55, 45, 115, 5, 15, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739, 739}
		-- Utility handlers
		on hashFunction(theKey)
			-- Hash function that takes as input a character reference entity and returns the hashedHtmlEntities list index and value for that key (or null values if the input key does not match a list item)
			-- The input key is the alphanumeric content of a character reference entity, e.g., "amp" for the ampersand character "&"
			-- The hash function was obtained by running the GNU gperf perfect hash function generator on the complete list of HTML-4/XHTML character reference entities and transforming the C code into equivalent Applescript code
			-- The hash function uses the hashFunctionTerms numeric list in the construction of the list index from the key
			try
				tell (get theKey's id)
					set keyLength to length
					set hashIndex to (my hashFunctionTerms's item ((item keyLength) + 1)) + keyLength + 1
					if keyLength > 0 then
						set hashIndex to hashIndex + (my hashFunctionTerms's item ((item 1) + 1))
						if keyLength > 1 then
							set hashIndex to hashIndex + (my hashFunctionTerms's item ((item 2) + 2))
							if keyLength > 2 then
								set hashIndex to hashIndex + (my hashFunctionTerms's item ((item 3) + 1))
								if keyLength > 4 then
									set hashIndex to hashIndex + (my hashFunctionTerms's item ((item 5) + 1))
								end if
							end if
						end if
					end if
				end tell
				set htmlEntity to my hashedHtmlEntities's item hashIndex
			on error
				set {hashIndex, htmlEntity} to {null, null}
			end try
			return {hashIndex:hashIndex, htmlEntity:htmlEntity}
		end hashFunction
		on hexToDec(hexString)
			-- Hexadecimal-to-decimal string converter
			-- Adapted from Nigel Garvey's hex-to-dec converter (http://macscripter.net/viewtopic.php?pid=71052)
			set refString to "0123456789abcdef"
			set tid to AppleScript's text item delimiters
			try
				tell hexString
					if its class is not equal to text then error
					if it = "" then return ""
				end tell
				set decVal to 0
				ignoring case
					repeat with i in (get hexString's characters)
						set AppleScript's text item delimiters to i's contents
						tell refString's text item 1's length
							if it = 16 then error
							set decVal to decVal * 16 + it
						end tell
					end repeat
				end ignoring
				set decimalString to decVal as text
			on error
				set AppleScript's text item delimiters to tid
				error
			end try
			set AppleScript's text item delimiters to tid
			return decimalString
		end hexToDec
		on reduceConsecutiveWhitespaces(theText)
			set tid to AppleScript's text item delimiters
			try
				set AppleScript's text item delimiters to {tab, return, linefeed}
				tell theText's text items
					set AppleScript's text item delimiters to space
					set theText to it as text
				end tell
				repeat
					set AppleScript's text item delimiters to space & space
					tell theText's text items
						if length = 1 then exit repeat
						set AppleScript's text item delimiters to space
						set theText to it as text
					end tell
				end repeat
			end try
			set AppleScript's text item delimiters to tid
			return theText
		end reduceConsecutiveWhitespaces
	end script
	-- Wrap the code in a try block to capture any errors
	set tid to AppleScript's text item delimiters
	try
		-- Process the handler argument
		try
			tell (handlerArgument & {condenseWhitespaces:false, decodeAngleBracketsPerHtml5:true})
				if length is not equal to 3 then error
				set {htmlString, condenseWhitespaces, decodeAngleBracketsPerHtml5} to {its htmlString, its condenseWhitespaces, its decodeAngleBracketsPerHtml5}
			end tell
		on error
			error "The handler argument must be a record with the following required property label:" & return & return & tab & "htmlString" & return & return & "and the following optional property labels:" & return & return & tab & "condenseWhitespaces" & return & tab & "decodeAngleBracketsPerHtml5"
		end try
		if htmlString's class is not equal to text then error "The input argument htmlString is not a text string."
		if condenseWhitespaces's class is not equal to boolean then error "The input argument condenseWhitespaces must be either true or false."
		if decodeAngleBracketsPerHtml5's class is not equal to boolean then error "The input argument decodeAngleBracketsPerHtml5 must be either true or false."
		-- Tailor the hashed html entities list to conform to the decodeAngleBracketsPerHtml5 setting
		set {i1, i2} to {util's hashFunction("lang")'s hashIndex, util's hashFunction("rang")'s hashIndex}
		tell (3 + (decodeAngleBracketsPerHtml5 as integer)) to set {util's hashedHtmlEntities's item i1's item 2, util's hashedHtmlEntities's item i2's item 2} to {util's hashedHtmlEntities's item i1's item it, util's hashedHtmlEntities's item i2's item it}
		-- Split the string at each "&" character
		set AppleScript's text item delimiters to "&"
		tell (get htmlString's text items)
			-- Handle the case of an html string without "&" characters
			if length < 2 then return htmlString
			-- Begin the decoded string with the text preceding the first "&" character
			set {decodedSubstrings, ampersandPrefixedSubstrings} to {{item 1}, rest}
		end tell
		-- Process the &-prefixed substrings
		repeat with currAmpersandPrefixedSubstring in ampersandPrefixedSubstrings
			try
				-- Split the current &-prefixed substring at each ";" character
				set AppleScript's text item delimiters to ";"
				tell (get currAmpersandPrefixedSubstring's text items)
					-- Handle the case of an &-prefixed substring without ";" characters
					if length < 2 then error
					-- Get the text preceding the first ";" character, which is the html entity candidate
					set {htmlEntityCandidate, semicolonPrefixedSubstrings} to {item 1, rest}
				end tell
				-- Test if the current html entity candidate is valid
				if htmlEntityCandidate starts with "#x" then
					-- Try processing the current html entity candidate as a hexadecimal numeric entity, and get its equivalent decimal value
					set decimalString to util's hexToDec(htmlEntityCandidate's text 3 thru -1)
				else if htmlEntityCandidate starts with "#" then
					-- Else try processing the current html entity candidate as a decimal numeric entity, and get its decimal value
					tell htmlEntityCandidate's text 2 thru -1
						ignoring white space
							-- Flag as invalid a decimal numeric entity that starts or ends with a whitespace character
							if (text 1 = "") or (text -1 = "") then error
						end ignoring
						set decimalString to it
					end tell
				else
					-- Else try processing the current html entity candidate as a character entity, and get its corresponding decimal value from the hashed html entities list
					tell htmlEntityCandidate's length to if (it < 2) or (it > 8) then error -- flags as invalid a character entity if its alphanumeric component's string length is invalid
					set hashedHtmlEntity to util's hashFunction(htmlEntityCandidate)'s htmlEntity
					considering case
						-- Confirm that the hash function returned the proper html entity
						if htmlEntityCandidate is not equal to hashedHtmlEntity's item 1 then error
					end considering
					set decimalString to hashedHtmlEntity's item 2
				end if
				-- Replace the html entity at the start of the current &-prefixed substring with its underlying Unicode character, and append the modified substring to the decoded string; if the replacement fails, the error will leave the current substring in its original form
				set end of decodedSubstrings to (character id decimalString) & (semicolonPrefixedSubstrings as text)
			on error
				-- If any error is encountered, thus signifying that the current &-prefixed substring does not start with a valid html entity, leave the current substring in its original form
				set end of decodedSubstrings to "&" & currAmpersandPrefixedSubstring
			end try
		end repeat
		-- Transform the list of substrings into a single decoded text string
		set AppleScript's text item delimiters to ""
		set decodedString to decodedSubstrings as text
		-- Condense consecutive whitespace characters if specified in the input argument condenseWhitespaces
		if condenseWhitespaces then set decodedString to util's reduceConsecutiveWhitespaces(decodedString)
	on error m number n
		set AppleScript's text item delimiters to tid
		if n = -128 then error number -128
		if n ≠ -2700 then set m to "(" & n & ") " & m
		error ("Problem with handler decodeHtml:" & return & return & m) number n
	end try
	set AppleScript's text item delimiters to tid
	-- Return the results
	return decodedString
end decodeHtml

{Edit note: The entire post was just resubmitted without changes to take advantage of the fact that the MacScripter website now renders Unicode characters properly. Previously garbled areas of text should now display properly.]

Nigel_Garvey · June 8, 2017, 1:33pm

Wow. That looks like a pretty comprehensive vanilla solution!

It’s somewhat easier in ASObjC, which leaves all the decoding effort to methods built into Mac OS’s Foundation framework:

use AppleScript version "2.4" -- Mac OS 10.10 (Yosemite) or later
use framework "Foundation"

on decodeHtml(handlerArgument)
	try
		-- Set variables to the arguments, using default values for any omitted. (Defaults for space condensation and angle bracket as per HTML5.)
		set {htmlString:str, condenseWhitespaces:condenseWhitespaces, decodeAngleBracketsPerHtml5:decodeAngleBracketsPerHtml5} to handlerArgument & {htmlString:missing value, condenseWhitespaces:true, decodeAngleBracketsPerHtml5:true}
		if (str is missing value) then error
	on error
		error "The handler argument must be a record with the following required property label:" & return & return & tab & "htmlString" & return & return & "and the following optional property labels:" & return & return & tab & "condenseWhitespaces" & return & tab & "decodeAngleBracketsPerHtml5"
	end try
	
	set |âŒ˜| to current application
	-- Get an NSMutableString version of the input string.
	set str to |âŒ˜|'s class "NSMutableString"'s stringWithString:(str)
	-- If not condensing white spaces, replace them with HTML equivalents.
	if (not condenseWhitespaces) then
		-- (The concatentations shown here are only needed when displaying the script code on a Web site. The entities themselves can be used otherwise.)
		tell str to replaceOccurrencesOfString:(space) withString:("&" & "nbsp;") options:(0) range:({0, its |length|()})
		tell str to replaceOccurrencesOfString:(tab) withString:("&" & "#9;") options:(0) range:({0, its |length|()})
		tell str to replaceOccurrencesOfString:("\\R") withString:("<br />") options:(|âŒ˜|'s NSRegularExpressionSearch) range:({0, its |length|()})
	end if
	-- Derive an NSData object from the HTML string and an NSAttributedString from that.
	set HTMLData to str's dataUsingEncoding:(|âŒ˜|'s NSUTF8StringEncoding)
	set attributedStr to |âŒ˜|'s class "NSAttributedString"'s alloc()'s initWithHTML:(HTMLData) documentAttributes:(missing value)
	-- Read off the decoded string from the NSAttributedString.
	set decodedString to attributedStr's |string|()
	-- Any angle brackets in the result are HTML5 interpretations. Replace them with the other type if required.
	if (not decodeAngleBracketsPerHtml5) then
		set decodedString to decodedString's stringByReplacingOccurrencesOfString:(character id 10216) withString:(character id 9001)
		set decodedString to decodedString's stringByReplacingOccurrencesOfString:(character id 10217) withString:(character id 9002)
	end if
	
	-- Return the final result as AppleScript text.
	return decodedString as text
end decodeHtml

bmose · June 8, 2017, 1:52pm

Thanks, Nigel, and wow, what a creative way to preserve whitespaces with the ASObjC decoder! That was one of the problems that prompted me to write an Applescript solution. The other side-effect of the ASObjcC decoder is that it strips away HTML tags. I often use those tags as handles for regular expression searches of downloaded web pages. The only Cocoa solution I could find involves Core Foundation’s CFXMLCreateStringByUnescapingEntities function. If that could be bridged to Applescript, that might be yet another good solution.

DJ_Bazzie_Wazzie · June 8, 2017, 2:29pm

When using xml entities you should use the HTML entities DTD list which can be found here: https://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html#a_dtd_xhtml_character_entities

bmose · June 8, 2017, 4:33pm

Thanks for the link, DJ. Is there any way of executing the CFXMLCreateStringByUnescapingEntities function from an Applescript script directly or indirectly that does not involve a full-fledged Cocoa application?

Nigel_Garvey · June 8, 2017, 7:23pm

I suppose it depends on what your ultimate aim is. In a script I use to get the content of MacScripter thread pages, formatted in a certain way as plain text, I use the tags to identify the sections I want to edit, do the edits, then delete all irrelevant tags and run whatever’s left through an NSAttributedString. If your aim’s just to convert HTML entities but leave the tags in place, you might get away with entitising (ouch!) the tag brackets too in the script above:

if (keepingTags) then
	tell str to replaceOccurrencesOfString:("<") withString:("&" & "lt;") options:(0) range:({0, its |length|()})
	tell str to replaceOccurrencesOfString:(">") withString:("&" & "gt;") options:(0) range:({0, its |length|()})
end if

Shane_Stanley · June 8, 2017, 11:47pm

You can wrap it in a framework. There are several open source third-party frameworks with categories on NSString to do what you want – you could build your own framework from one of those.

bmose · June 9, 2017, 5:08am

Thanks for another creative entitization (:lol:) suggestion. I tend to parse as you do: use tags as search handles → extract the desired text → decode HTML entities. My focus on preserving HTML tags is more for robustness so that that option is available for some future need.

That sounds like a great solution for the current task and also a powerful tool in general. I don’t have experience creating frameworks. Could you possibly suggest a specific link that might be particularly helpful for someone like myself climbing up the learning curve?

Shane_Stanley · June 9, 2017, 6:04am

If you’ve used Xcode at all, it’s pretty simple. Assuming you have tracked down suitable Objective-C files, you create a new project in Xcode, choose macOS and Cocoa Framework as the template, then add your Objective-C .h and .m files to the project. The only settings you may need to change are for deployment target, and what headers are exposed (Build Phases → Headers, and make public what you want exposed).

So in theory you can go to something like this https://github.com/mwaterfall/MWFeedParser, download it, copy NSString+HTML.h, NSString+HTML.m, GTMNSString+HTML.h and GTMNSString+HTML.m to your project (plus the required copyright attributions), Build (For Profiling), put it in ~/Library/Frameworks and use it like this:

use framework "Foundation"
use framework "NameOfFramework"
use scripting additions

on decodeHtml(handlerArgument)
	set str to htmlString of handlerArgument
	set str to current application's NSString's stringWithString:str
	return str's stringByDecodingHTMLEntities() as string
end decodeHtml

Because the code is a category that extends NSString rather than adding a new class, that’s it.

However… that’s not a totally good idea. The problem is that when you use that in scripts run from app menus, you’re loading the framework into the host app. And it’s bad form to add categories like that – it’s probably safe, but there’s a small element of risk. I wouldn’t distribute it that way. You’re better off changing the categories to new classes with your own prefix. It’s more work to call them that way, but it’s safer. (This is what I do with SMSForder in BridgePlus).

bmose · June 9, 2017, 11:44am

Thanks for the link and the very helpful instructions. It opens up so many possibilities. I appreciate your safety advice about categories and will try to get into the habit of making my own classes from the start.