Help extracting text from webpages to MS Word required.

elloydowen · May 31, 2011, 2:46pm

Hoping that someone is able to help me save hours in my day…

Everyday I produce a report for my boss with a number of news articles. I select the articles from my Google Reader by marking them as favourites. I then open article each in turn in my browser and select the title, text, article, publication and URL and extract it all to Microsoft Word for Mac 2011. I then format it and email it to the boss.

I’m trying to make automator conduct the opening and extraction of each article by inputting the URL of my google reader favourites list into it and getting a word doc with one article per page back.

I’m stuck.

i can make it extract the RSS feed summaries (easy). I can make it find the URLs of the articles and open them (easy). But I can’t make automator open the URLs, extract only the relevant text and put it into a word doc… I know that a script is required but I can’t write it and I’m sure its a pretty simple task.

PLEASE can anyone help me and provide the very small link that I’m missing…? It will save me HOURS.

Thank you in advance.

PS. I’m now teaching myself applescript. Any advice on the best way of doing it - much appreciated!!

Model: Macbook Pro 2.66 GHz Intel Core 2 Duo
AppleScript: 2.3
Browser: Firefox 4.0.1
Operating System: Mac OS X (10.6)

Adam_Bell · May 31, 2011, 9:14pm

You’re stuck for a good reason – the articles’ html will, in general, all be different so identifying content will not be a trivial pursuit. In my weather tutorial I was going to a single site; you want to go to several. “…extract only the relevant text and put it into a word doc…” requires knowing how to identify the relevant text.

I think I’d be inclined to write a script that opened a new Word doc, then put the contents of the clipboard into it sequentially, i.e. you select the bits you want and invoke the script repeatedly. I don’t have Word for Mac 2011 so I can’t help you there.

elloydowen · June 1, 2011, 6:58am

Ha ha! At least its not just me! And I thought it would be simple…

Thank you for your reply though Adam. The problem is that I’m brand new to Applescript and can’t yet write it so this is way beyond my very meagre capabilities.

To be honest, the word doc bit is slightly academic - just generating a text file would be fine as I can then just cut and paste, and format once I’ve done that. I’ve managed to achieve it with a single URL but I end up with strings and strings of text from elsewhere on the webpage that are other links or simply irrelevant; all I want is the text of the actual article and its header (without images). I can’t, however, do it with the RSS feed of my Starred Items from my Google Reader.

However, it occurred to me that because I’m only really pulling articles from the same handful of newsources (Reuters, Bloomberg, WSJ, FT, NYT, The Times etc etc) that maybe the script could look for the link that takes the reader to the print friendly version (all of the articles have them and they all seem to share the word ‘print’ in the URL) and then take all the text from that page, that would be perfectly acceptable as that’s generally how I do it anyway. Do you think that would work?

But yes, as you say, not a trivial pursuit. So, if anyone out there has the capability, inclination and time to help me, I would be extraordinarily grateful…

Many thanks.

Ed

Model: Macbook Pro 2.66 GHz Intel Core 2 Duo
AppleScript: 2.3
Browser: Firefox 4.0.1
Operating System: Mac OS X (10.6)