Get web page data and work with results

Hi all,

What would be the best way to go about doing the following:

Every day at a certain time I want to capture the source data of a given web page. I want to do some work on the source data using Applescript and Tex-Edit Plus 4.9.7.

I prefer not to use iCal, since I don’t use that program and I’d rather not get it going just for this one thing. I would prefer to use Firefox to get the source data, but using iCab, OmniWeb, or Safari is perfectly fine. If there are other options available, no problem (as long as I don’t need to go installing extra software).

The source data is just text on a page. The only HTML tags used are , , , and

 (and their closing tags as well).  There is no fancy formatting or non- English characters in the data.

I can handle writing the script to work with the data, but I don’t know how to capture it in the first place. Ideally, I will be able to save the source data and call a second script to work on it (which would be done in Tex-Edit Plus).

Thanks for any help!

Model: PowerBook PowerPC G4
AppleScript: 1.9.3
Browser: Firefox 2.0.0.7
Operating System: Mac OS X (10.3.9)

Hi,

it depends on the structure of the source data,
but the most elegant way to get the data is using the shell command curl.
It doesn’t require any browser and looks like

set sourcetext to do shell script "curl http://www.mypage.com"

Perfect! That gets me exactly what I want. It includes the HTML tags as well, but removing them is very easy now that I have the source data.

What would be a good way to launch this script on a timed basis (once a day)?

take a look at my launchd installer script.
That’s a great solution, because it works completely in the background

Hmm, I’m still running on OS X.3.9 software. I probably should update, but I was hoping to put it off for a little longer (at least until Leopard comes out). Any other options, or maybe should I consider using iCal after all (which I’d rather avoid)?

You can use Cronnix, which is a front end to cron that runs in OS X 10.3.

I also tried this out, a very elegant solution.

I couldn’t get to an url with querys though (like: www.mydomain.com/page.php?name=Bob). With this script i got the same value all the time, as if I didn’t have any query.

Is there any workaround, or am i doing something wrong?

I searched a bit and found this http://bbs.applescript.net/viewtopic.php?pid=72647

so it works fine when doing it like this (now with variables, which was what i wanted in the first place):

set nam to "Bob"
set pho to "12345"
set theUrl to get quoted form of ("http://www.mydomain.com/page.php?name=" & nam & "&phone=" & pho)
do shell script "curl " & theUrl

This forum is so awesome, thanks a lot!

Stefan and Adam, thank you both so much for your help. I’ve been playing around with Cronnix and my script, and things are almost perfect. If you have time for another question I would appreciate your help again.

I’ve found that when I launch my script via Cronnix, it runs extremely slowly. If I run it from Tex-Edit itself, it finishes almost instantaneously. Running it from ScriptDebugger is slower than Tex-Edit (but I expect this).

Why does the script run so slowly from the cron job?

I found, if I wanted the script to launch at all, I needed to save the script as an Application (Carbon). Is this what causes the slowness? Is there any way to speed the script up?