What is the applescript equivalent of the Automator action “Get Text from Webpage”?
I was using curl, but it has recently started returning a “This Document has Moved to here” with a link to the original page. Maybe someone doesn’t like me requesting their HTML source?
Thanks StefanK! That totally worked for getting around the redirect.
I think I’m still looking for a text based “print preview” version of the webpage because I have a lot of predefined tags I’m using to parse data. With curl I end up having to strip a lot of html.
Is there really no Applescript equivalent of get text from webpage? Is it easier to convert html to text after I’ve done curl?
You can get the source of a Safari webpage, but the result would be the same as the result of the curl command.
property newline : character id 10 ” tinkered to use id instead of deprecated ASCII character
tell application "Safari"
-- The following line collects info used by the Terminal editors
set myURL to the URL of document 1 ” as string *removed* uses text as of AS 2.0
-- Retrieve the source from the browser
set mySource to the source of document 1 ” as text *removed* unnecessary coercion as of AS 2.0 -was string
end tell
This is shamelessly stolen from Daniel S. Rubins get browser source scripts (dan@webgraph.com)
I’m sure there is somebody out there who have some great scripts for stripping away the header and such until you get the body, which is what you are really interested in.
I believe that you would really need a “posessive” version of the extractBetween script from this post , which should consider nested tags between the tags, so that you could parse the source hierarchially. By posessive in this context I mean that it would return all the text between the start tag, and its end tag. It exclude the tags searched for
and take “malformed” pages into consideration; that several tags can appear on the same line.
This is a topic I really know far to little about, so for the fun of it i tried apropos html in a terminal window
and voila : I found a command named htmlparse but this package are for people using tcl, and I’m not one of those. Maybe some others can help you with this issue.
I’m sure there are some handlers for parsing html with AppleScript here which could suit you if you execute this query not knowing exactly what you are looking for.
form the dictionary:
URL get/set unicode text The current URL of the document.
source get unicode text The HTML source of the web page currently loaded in the document.
By the way: Since AppleScript 2 there is a constant linefeed for character id 10