set myHtml to do shell script "curl [url=http://www.wpr.org/book/lastweek.html]http://www.wpr.org/book/lastweek.html"[/url]
set {myDels, AppleScript's text item delimiters} to {AppleScript's text item delimiters, {"</head>"}}
set myText to text item 2 of myHtml
set AppleScript's text item delimiters to myDels
set myText to remove_markup(myText)
on remove_markup(this_text)
set myAllowed to {"://", "@"}
set {myTID, AppleScript's text item delimiters} to {AppleScript's text item delimiters, {"<"}}
set these_items to text items of this_text
set AppleScript's text item delimiters to {">"}
repeat with my_item in these_items
if length of text items of my_item is greater than 1 then
if (text item 1 of my_item) is not in myAllowed then
set my_item's contents to (text item 2 of my_item)
else
set my_item's contents to ("<" & (text items of my_item)) as text
end if
end if
end repeat
set AppleScript's text item delimiters to myTID
set clean_text to these_items as text
return clean_text
end remove_markup
set newtext to remove_markup()
the text returns a lot of
and the newtext command gives the error
I don’t understand why this doesn’t work. Isn’t it declared as one variable on the call statement?
I have tried using this script to remove the text
set thisText1 to these_items as text
set AppleScript’s text item delimiters to " "
set thisText1 to thisText1’s text items
set AppleScript’s text item delimiters to “”
set thisText1 to “” & thisText1
set AppleScript’s text item delimiters to {“”}
return thisText1
but this isn’t working inside the subroutine and I can’t call it outside of the subroutine.
why?
The first line of the “handler statement” (all the text from on remove_markup(this_text) to end remove_markup) shows that the handler expects to be passed one value when it’s called.
Your “call statement” to that handler (set newtext to remove_markup()) has nothing in its parameter list, so it’s not passing any values to the handler. Hence the error.
From the fact that you’ve put the handler statement in the middle of the running code, I’d guess you haven’t appreciated that a handler is a discrete block of code that’s only run when it’s called. The variable(s) mentioned in its first line (‘this_text’ in this case) are there to receive the values passed by the call and are set when the handler’s called. They’re local to the handler and can’t be seen in the script outside. More specifically, they’re local to that particular execution of the handler. If the handler’s called again, or if it calls itself recursively, the values of the parameter variables only apply to the current call. (Each execution has its own set of variables with those names.) The same goes for any other variables used in the handler that aren’t explicitly declared globals or properties.
If you want a handler to return a value, use a ‘return’ statement as you’ve done above. Otherwise, the handler will return the result of the last statement executed in it.
It’s neater and less confusing not to put handler statements in the middle of the run code, but to have them all at one end of the script, either at the beginning or at the end:
set myHtml to do shell script "curl [url=http://www.wpr.org/book/lastweek.html]http://www.wpr.org/book/lastweek.html"[/url]
set {myDels, AppleScript's text item delimiters} to {AppleScript's text item delimiters, {"</head>"}}
set myText to text item 2 of myHtml
set AppleScript's text item delimiters to myDels
set myText to remove_markup(myText)
set newtext to remove_markup("I can't think of anything to <say>")
-- Handler statement(s) at one end of the script.
on remove_markup(this_text)
set myAllowed to {"://", "@"}
set {myTID, AppleScript's text item delimiters} to {AppleScript's text item delimiters, {"<"}}
set these_items to text items of this_text
set AppleScript's text item delimiters to {">"}
repeat with my_item in these_items
if length of text items of my_item is greater than 1 then
if (text item 1 of my_item) is not in myAllowed then
set my_item's contents to (text item 2 of my_item)
else
set my_item's contents to ("<" & (text items of my_item)) as text
end if
end if
end repeat
set AppleScript's text item delimiters to myTID
set clean_text to these_items as text
return clean_text
end remove_markup
just for sports, here is a different appraoch to extract the text from the website
property a1 : «data utxtFFFC» as Unicode text
property a2 : «data utxtFFFC2028» as Unicode text
property a3 : «data utxtFFFCFFFC» as Unicode text
property a4 : «data utxt00A0» as Unicode text
property a5 : «data utxt2028» as Unicode text
property a6 : «data utxtFFFC00202028» as Unicode text
property a7 : «data utxtFFFC2028FFFC2028» as Unicode text
set temp to ((path to temporary items as Unicode text) & "wpr.txt") -- define tempfile
set Ptemp to quoted form of POSIX path of temp
do shell script "curl http://www.wpr.org/book/lastweek.html -o " & Ptemp
do shell script "textutil -format html -inputencoding iso-8859-1 -convert txt -encoding UTF-16 " & Ptemp -- convert html to txt
set theText to paragraphs of (read file temp as Unicode text)
do shell script "rm " & Ptemp -- delete tempfile
set newText to {}
repeat with i in theText
tell contents of i
if {it} is in {a1, a2, a3, a4} or it is "" or it begins with a7 then
-- do nothing
else if it ends with a1 then
set end of newText to text 1 thru -2 of it
else if it begins with a6 then
set end of newText to text 4 thru -1 of it
else if it begins with a2 or it begins with (a1 & space) then
set end of newText to text 3 thru -1 of it
else if it begins with a1 or it begins with space or it begins with a5 then
set end of newText to text 2 thru -1 of it
else
set end of newText to it
end if
end tell
end repeat
set {TID, text item delimiters} to {text item delimiters, return}
set newText to newText as Unicode text
set text item delimiters to TID
newText
Is there a definite reason for having a1 - a7 set as properties, as opposed to normal variables?
The reason I ask is to be able to make that code into a self-contained subroutine that I could reuse easily from one script to the next, without splitting it up by having separate properties. I tried it and it worked…but I didn’t know if there was a more long term reason that one attempt did not show.
The main reason is to save time. The values are predefined when you run the script.
But you can also use normal variables.
If you access other sites than www.wpr.org, you can probably skip the filter lines, for example apple.com
set temp to ((path to temporary items as Unicode text) & "wpr.txt") -- define tempfile
set Ptemp to quoted form of POSIX path of temp
do shell script "curl http://www.apple.com/ -o " & Ptemp
do shell script "textutil -format html -inputencoding iso-8859-1 -convert txt -encoding UTF-16 " & Ptemp -- convert html to txt
set theText to paragraphs of (read file temp as Unicode text)
do shell script "rm " & Ptemp -- delete tempfile
set {TID, text item delimiters} to {text item delimiters, return}
set theText to theText as Unicode text
set text item delimiters to TID
theText
Note: in Leopard you even don’t need the temp file, because textutil can convert stdin to stdout
If you happen to own a copy of DEVONagent or DEVONthink, then the following code is a very convenient way to extract (rich) text from a given website. I use this a lot in our company when it comes to automated patent analysis:
tell application "DEVONagent"
set htmlcode to download markup from "http://en.wikipedia.org/wiki/Steve_Wozniak"
set sitetext to get rich text of htmlcode
end tell
If you only own DEVONthink, then just replace DEVONagent with DEVONthink in the above code example.