Get Text from Webpage - Applescript

newhero · June 11, 2010, 8:13am

What is the applescript equivalent of the Automator action “Get Text from Webpage”?

I was using curl, but it has recently started returning a “This Document has Moved to here” with a link to the original page. Maybe someone doesn’t like me requesting their HTML source?

Thank you.

StefanK · June 11, 2010, 8:16am

Hi,

try curl with the -L switch, which follows redirections

newhero · June 12, 2010, 1:54am

Thanks StefanK! That totally worked for getting around the redirect.

I think I’m still looking for a text based “print preview” version of the webpage because I have a lot of predefined tags I’m using to parse data. With curl I end up having to strip a lot of html.

Is there really no Applescript equivalent of get text from webpage? Is it easier to convert html to text after I’ve done curl?

Dylan_Weber · June 12, 2010, 3:52am

Here is what I use:

do shell script "curl " & site & " | textutil -stdin -stdout -format html -convert txt -encoding UTF-8 "

McUsr · June 12, 2010, 8:52am

Hello

You can get the source of a Safari webpage, but the result would be the same as the result of the curl command.


property newline :  character id 10 ” tinkered to use id instead of deprecated ASCII character

tell application "Safari"
	-- The following line collects info used by the Terminal editors
	set myURL to the URL of document 1 ”  as string *removed* uses text as of AS 2.0
	
	-- Retrieve the source from the browser
	set mySource to the source of document 1 ” as text  *removed* unnecessary coercion as of AS 2.0 -was string
end tell

This is shamelessly stolen from Daniel S. Rubins get browser source scripts (dan@webgraph.com)

I’m sure there is somebody out there who have some great scripts for stripping away the header and such until you get the body, which is what you are really interested in.

I believe that you would really need a “posessive” version of the extractBetween script from this post , which should consider nested tags between the tags, so that you could parse the source hierarchially. By posessive in this context I mean that it would return all the text between the start tag, and its end tag. It exclude the tags searched for
and take “malformed” pages into consideration; that several tags can appear on the same line.

This is a topic I really know far to little about, so for the fun of it i tried apropos html in a terminal window
and voila : I found a command named htmlparse but this package are for people using tcl, and I’m not one of those. Maybe some others can help you with this issue.

I’m sure there are some handlers for parsing html with AppleScript here which could suit you if you execute this query not knowing exactly what you are looking for.

Best Regards

McUsr

StefanK · June 12, 2010, 10:40am

two unnecessary coercions

form the dictionary:
URL get/set unicode text The current URL of the document.
source get unicode text The HTML source of the web page currently loaded in the document.

By the way: Since AppleScript 2 there is a constant linefeed for character id 10

McUsr · June 12, 2010, 1:16pm

Thanks Stefan

I’ll correct it in the post above. that is omit the coercions in both cases.

It so useful when you remind me of this, making the matter stick.
Hopefully you won’t have to do this all of the time. But please don’t stop.

Best Regards

McUsr