Removing or decoding double encoded HTML Character Entities,

While using cUrl to download a string for another project, I ran into double encoded characters in my downloaded string of text. Where there should have been an apostrophe character there was instead an encoded ampersand nestled inside the encoding for an apostrophe (a.k.a. the single quote character).

The whole mess looked like this: (I added spaces here so it would read easily.)
& a m p ; # 3 9 ;

Obviously, & a m p ; is supposed to be an ampersand, so that when added to # 3 9 ;
it would be & # 3 9 ; which is the encoded HTML entity for an apostrophe (a.k.a. the single quote character).

My solution was to first use a PHP shell script I found in another thread to decode the ampersand, then incrementally parse sections of the string looking for other needy HTML Entities.

I thought I’d share this since it combines two solutions to two similar problems.
I hope it comes in handy for someone with a similar challenge!

 
set ORIGstrng to "Anxiety in a man's heart weighs him down, but a good word makes him glad."
#NOTE: the above string opened correctly for me with a double encoding of the ampersand of the encoded apostrophe when Macsripter's "Open this Scriplet in your Editor:" link was clicked. (yay!) However it is interpreted to look like an ampersand in Macsripter's code-display, so it only appears as a single encoding of the apostrophe character when viewed online. Enjoy.

# THIS SOLVES THE  &__;  entity
set NEWstrng to do shell script "php -R 'echo html_entity_decode($argn).\"\\n\";' <<<" & quoted form of ORIGstrng

# THIS SOLVES THE  &#__; entity!
set rslt to ""
set skipchars to 0
set extndSTRNG to NEWstrng & "****"
repeat with chars from 1 to (count of extndSTRNG)
	if chars ≤ (count of NEWstrng) then
		set strng to (items chars thru (chars + 4) of extndSTRNG) as text
		set nxt to item 1 of strng
		if item 1 of strng contains "&" then
			set amp to offset of "&" in strng
			set sem to amp + (offset of ";" in (items (amp) thru (-1) of strng) as text) - 1
			try
				set HTMLentity to (items amp thru sem of strng) as text --"&"
				set ASKYnum to ((items 3 thru -2 of HTMLentity) as text) as number
				set rslt to rslt & (ASCII character ASKYnum)
				set skipchars to length of HTMLentity
			end try
		end if
		if skipchars = 0 then
			set rslt to rslt & nxt
		else
			set skipchars to skipchars - 1
		end if
	end if
end repeat
return rslt --"Anxiety in a man's heart weighs him down, but a good word makes him glad."

Hi Mr. Science.

Thanks for sharing this. I’d want to know why the downloaded text contained double-encoded entitities, but it does make an interesting coding exercise. :slight_smile:

There are actually a couple of issues with your code. Firstly:

set strng to (items chars thru (chars + 4) of extndSTRNG) as text

… and a couple of similar lines. The basic elements of text are ‘characters’, not ‘items’, although the latter term does work. But in any case, extracting a list of characters and then coercing it to text isn’t efficient and the result depends on the current value of AppleScript’s text item delimiters. This version extracts the substring directly and is less to type:

set strng to (text chars thru (chars + 4) of extndSTRNG)

The other point is that ‘ASCII character’ and ‘ASCII number’ have been deprecated for a while and we’re now encouraged to use ‘character id’ and ‘id of’ instead. These are better suited for use with AppleScript’s now-Unicode text.

If you’re running macOS 10.10 or later, this ASObjC script may be of interest. Sorry about the name of the handler. :wink:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

on multiDeentitise(stringOrText, levels)
	set |⌘| to current application
	-- Get an NSString version of the original.
	set theString to |⌘|'s class "NSString"'s stringWithString:(stringOrText)
	
	repeat levels times
		-- Get an NSData version of the string and derive an NSAttributedString from the result, assuming it to be HTML data.
		set theData to theString's dataUsingEncoding:(|⌘|'s NSUnicodeStringEncoding)
		set attrString to |⌘|'s class "NSAttributedString"'s alloc()'s initWithHTML:(theData) baseURL:(missing value) documentAttributes:(missing value)
		-- Extract a single-decoded string from the NSAttributedString.
		set theString to attrString's |string|()
	end repeat
	
	-- Return the final result as AS text.
	return theString as text
end multiDeentitise

set ORIGstrng to "Anxiety in a man&amp" & ";#39" & ";s heart weighs him down, but a good word makes him glad."
multiDeentitise(ORIGstrng, 2) -- Resolve two levels of encoding.

An alternative to specifying the number of levels would be simply to repeat until the string didn’t change.

Thank you, Nigel. I always look forward to your commentary but it appears I missed this one a few months back. I apologize for not responding sooner. I guess I should have subscribed to my own post! As tips on improving efficiency are a welcome salve to my coding “style” I’m definitely going to be reading up on NSString classes! Was a bit disappointed to see this warning about ASObjC not playing nice in Sierra and up(https://www.macosxautomation.com/applescript/apps/runner.html). But I’m on Yosemite still for now so that’s OK. Also, I like your use of the command character(⌘) as a variable. I have been using more unicode characters including the colored ‘icons’ that run just fine in both scripts or alerts.

Since you appreciate a good slapstick effort, I’ll leave you with another of my inefficient gems. This was all I could think of to get the parent folder in a one liner the other night. I expect there’s a better way but this got everything flowing at the time so I ran with it!


set hom to getParentFLDR(path to desktop)
on getParentFLDR(pth)
	return alias ((items 1 thru (-1 - (offset of ":" in (reverse of (items 1 thru -2 of (pth as text))) as text)) of (pth as text)) as text)
end getParentFLDR

Model: Mac Pro, Yosemite
AppleScript: 2.7
Browser: Safari 601.2.7
Operating System: Mac OS X (10.13 Developer Beta 3)

Thats a warning about ASObjC Runner.app, not ASObjC itself. ASObjC is fine in Sierra and up.

That works only as long as the TIDs are set to their default value. For safety, you’d have to use a three-liner:


set AppleScript's text item delimiters to ":o)"
set hom to getParentFLDR(path to desktop)
on getParentFLDR(pth)
	set {astid, AppleScript's text item delimiters} to {AppleScript's text item delimiters, ""}
	set {parentFLDR, AppleScript's text item delimiters} to {alias (text 1 thru (-1 - (offset of ":" in (reverse of (characters 1 thru -2 of (pth as text))) as text)) of (pth as text)), astid}
	return parentFLDR
end getParentFLDR

Or preferably:


set AppleScript's text item delimiters to ":o)"
set hom to getParentFLDR(path to desktop)
on getParentFLDR(pth)
	set {astid, AppleScript's text item delimiters} to {AppleScript's text item delimiters, ":"}
	tell (pth as text) to set {parentFLDR, AppleScript's text item delimiters} to {(text 1 thru text item (-2 - ((it ends with ":") as integer))) as alias, astid}
	return parentFLDR
end getParentFLDR

:wink:

Thanks Nigel! :smiley:

Shane, thanks for the correction. Looks like I have more reading to do!

Lest any future readers be led astray by posts #3 and #5 above, these are the more usual approaches:


set hom to getParentFLDR(path to desktop)

on getParentFLDR(pth)
	tell application "System Events" to return (path of container of (pth as alias)) as alias
end getParentFLDR

Or:


set hom to getParentFLDR(path to desktop)

on getParentFLDR(pth)
	tell application "Finder" to return (container of (pth as alias)) as alias
end getParentFLDR