Converting pdf to web pages

chris2 · April 20, 2009, 6:54am

I have pdf files which are print-outs of web pages (using cups-pdf). (If you don’t understand this, assume that I have saved my web pages as pdf files)

At the footer (bottom) of each page of such pdf file, I can see the URL showing the source of the web page (i.e. the address of the web page e.g. http://macscripter.net/viewtopic.php?id=16145)

Now, I want to extract the URL of each of such pdf file and open it in Firefox. (I am prepared to use UI scripting to deal with Firefox)

I am unable to find out how to refer to the footer of each page to get the URL. Though I use Skim for pdf files, I don’t mind using Preview for this task. I just want the URLs. Since there are a large number of pdf files (about 50 files), a script that would process selected files in Finder would be better. Please get me started with the URL part. I will probably figure my way out from there.

Thanks.

StefanK · April 20, 2009, 7:29am

wouldn’t it be easier to script the print-out and save the addresses there?

chris2 · April 20, 2009, 7:30am

sorry, i did not understand what you just wrote

StefanK · April 20, 2009, 7:38am

if you could do the print-outs with AppleScript, then you have the addresses and can save them for later use

chris2 · April 20, 2009, 7:47am

Sorry for the confusion. It happens very frequently when I write something.

means I already have the pdf files…I am not taking the print-outs “now”.

chris2 · April 20, 2009, 10:43am

If someone does not understand me, please see the pdf file here:
http://dl.getdropbox.com/u/872430/MacScripter%20_%20AXShowMenu.pdf
You can see a link at the bottom of every page. I want applescript to give me just that link. Thats all.

blend3 · April 20, 2009, 1:55pm

Hi,
the code below seems to work on your PDF:

set thefile to choose file
tell application "Finder" to set filename to name of thefile

tell application "Adobe Acrobat Professional"
	activate
	open thefile
	execute menu item "SelectAll" of menu "Edit"
	if enabled of menu item "Copy" of menu "Edit" then
		execute menu item "Copy" of menu "Edit"
		set theText to the clipboard
	else
		set theText to "No text found in PDF"
	end if
	close document 1 saving no
end tell

set url_list to {}
repeat with i from 1 to the count of paragraphs of theText
	set this_para to paragraph i of theText
	if this_para contains "http://" then
		set oldDels to AppleScript's text item delimiters
		set AppleScript's text item delimiters to "http://"
		set a to text item 2 of this_para
		set AppleScript's text item delimiters to " "
		set the_url to text item 1 of a
		set AppleScript's text item delimiters to oldDels
		copy "http://" & the_url to end of url_list
	end if
end repeat
choose from list url_list

Thanks,
Nik

chris2 · April 20, 2009, 2:04pm

Thanks a lot, Blend3
I will modify it for Skim pdf reader. I got the idea of what you are trying to do. I guess, if we are searching through paragraph beginnings (instead of paragraph “contains”), then it would make sense if we checked only the last paragraph and that too of only page 1 of each pdf file because all pages have the same link. Would it be possible to do so?

blend3 · April 20, 2009, 9:18pm

Hi Chris,
I totally agree with your logic but Skim doesn’t appear to retain the formating of the pdf when you get every paragraph or text, so I came up with this:

set thefile to choose file

tell application "Skim"
	open thefile
	set theText to get text for page 1 of document 1
	set oldDels to AppleScript's text item delimiters
	set AppleScript's text item delimiters to "http://"
	set a to text item 2 of theText
	set AppleScript's text item delimiters to " "
	set b to text item 1 of a
	set AppleScript's text item delimiters to oldDels
	set the_url to "http://" & b
end tell

the_url

Hope this helps,
Nik

chris2 · April 21, 2009, 7:40am

To blend3:
Thanks, blend3 for writing the script. (I had some problems using the script which i will be able to fix if i have to use your script)
To Jacques:
Many thanks once again. Your script is exactly what i had asked for when i made my first post in this thread.

chris2 · April 21, 2009, 10:46am

Hi Jacques,

I was using Safari for 6 months and I had problems with the way Safari saved web pages. So I used to save web pages as pdf. But it had several drawbacks which I could no longer live with. It took me quite some time to find the perfect way to save web pages. I won’t name that Firefox plugin since I don’t want anyone to think that I am unduly publicizing it.
I can’t tell you how astonished I am with the way the script worked.
I set up a hotkey for your script and I could convert all my 150+ pdf files (scattered in several different places) into web pages within 3 minutes (because the script also works in Finder window which shows spotlight results)
I still can’t believe how quickly it all happened.

I wish I knew what this code meant:

Can you (or anyone) tell me what in bash should i refer to, to understand it? (I know nothing about bash)

chris2 · April 22, 2009, 6:07am

Thanks for all the details. I hope, someday I shall be able to write a magical script like you!!