Friday, August 17, 2018
  • Index
  •  » Code Exchange
  •  » Removing or decoding double encoded HTML Character Entities,

#1 2018-02-05 06:01:23 pm

Mr. Science
Member
From:: Satellite Beach, Florida
Registered: 2015-08-13
Posts: 8
Website

Removing or decoding double encoded HTML Character Entities,

While using cUrl to download a string for another project, I ran into double encoded characters in my downloaded string of text. Where there should have been an apostrophe character there was instead an encoded ampersand nestled inside the encoding for an apostrophe (a.k.a. the single quote character).

The whole mess looked like this: (I added spaces here so it would read easily.)
& a m p ; # 3 9 ;

Obviously,  & a m p ;   is supposed to be an ampersand,  so that when added to   # 3 9 ;
it would be   & # 3 9 ;   which is the encoded HTML entity for an apostrophe (a.k.a. the single quote character).

My solution was to first use a PHP shell script I found in another thread to decode the ampersand, then incrementally parse sections of the string looking for other needy HTML Entities.

I thought I'd share this since it combines two solutions to two similar problems.
I hope it comes in handy for someone with a similar challenge!

Applescript:


set ORIGstrng to "Anxiety in a man's heart weighs him down, but a good word makes him glad."
#NOTE: the above string opened correctly for me with a double encoding of the ampersand of the encoded apostrophe when Macsripter's "Open this Scriplet in your Editor:" link was clicked. (yay!) However it is interpreted to look like an ampersand in Macsripter's code-display, so it only appears as a single encoding of the apostrophe character when viewed online. Enjoy.

# THIS SOLVES THE &__; entity
set NEWstrng to do shell script "php -R 'echo html_entity_decode($argn).\"\\n\";' <<<" & quoted form of ORIGstrng

# THIS SOLVES THE &#__; entity!
set rslt to ""
set skipchars to 0
set extndSTRNG to NEWstrng & "****"
repeat with chars from 1 to (count of extndSTRNG)
   if chars ≤ (count of NEWstrng) then
       set strng to (items chars thru (chars + 4) of extndSTRNG) as text
       set nxt to item 1 of strng
       if item 1 of strng contains "&" then
           set amp to offset of "&" in strng
           set sem to amp + (offset of ";" in (items (amp) thru (-1) of strng) as text) - 1
           try
               set HTMLentity to (items amp thru sem of strng) as text --"&"
               set ASKYnum to ((items 3 thru -2 of HTMLentity) as text) as number
               set rslt to rslt & (ASCII character ASKYnum)
               set skipchars to length of HTMLentity
           end try
       end if
       if skipchars = 0 then
           set rslt to rslt & nxt
       else
           set skipchars to skipchars - 1
       end if
   end if
end repeat
return rslt --"Anxiety in a man's heart weighs him down, but a good word makes him glad."


"Fail and fail until you fail to fail!"   ~   http://www.theMrScienceShow.com

Offline

 

#2 2018-02-06 05:05:33 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4622

Re: Removing or decoding double encoded HTML Character Entities,

Hi Mr. Science.

Thanks for sharing this. I'd want to know why the downloaded text contained double-encoded entitities, but it does make an interesting coding exercise.  smile

There are actually a couple of issues with your code. Firstly:

Applescript:

set strng to (items chars thru (chars + 4) of extndSTRNG) as text

… and a couple of similar lines. The basic elements of text are 'characters', not 'items', although the latter term does work. But in any case, extracting a list of characters and then coercing it to text isn't efficient and the result depends on the current value of AppleScript's text item delimiters. This version extracts the substring directly and is less to type:

Applescript:

set strng to (text chars thru (chars + 4) of extndSTRNG)

The other point is that 'ASCII character' and 'ASCII number' have been deprecated for a while and we're now encouraged to use 'character id' and 'id of' instead. These are better suited for use with AppleScript's now-Unicode text.

If you're running macOS 10.10 or later, this ASObjC script may be of interest. Sorry about the name of the handler.  wink

Applescript:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

on multiDeentitise(stringOrText, levels)
   set |⌘| to current application
   -- Get an NSString version of the original.
   set theString to |⌘|'s class "NSString"'s stringWithString:(stringOrText)
   
   repeat levels times
       -- Get an NSData version of the string and derive an NSAttributedString from the result, assuming it to be HTML data.
       set theData to theString's dataUsingEncoding:(|⌘|'s NSUnicodeStringEncoding)
       set attrString to |⌘|'s class "NSAttributedString"'s alloc()'s initWithHTML:(theData) baseURL:(missing value) documentAttributes:(missing value)
       -- Extract a single-decoded string from the NSAttributedString.
       set theString to attrString's |string|()
   end repeat
   
   -- Return the final result as AS text.
   return theString as text
end multiDeentitise

set ORIGstrng to "Anxiety in a man&" & ";#39" & ";s heart weighs him down, but a good word makes him glad."
multiDeentitise(ORIGstrng, 2) -- Resolve two levels of encoding.

An alternative to specifying the number of levels would be simply to repeat until the string didn't change.


NG

Offline

 
  • Index
  •  » Code Exchange
  •  » Removing or decoding double encoded HTML Character Entities,

Board footer

Powered by FluxBB

RSS (new topics) RSS (active topics)