I prefer to listen to my newspapers rather than read them and also it serves as a better storage. So i download an electronic copy. I wish to automate this task since i do it everyday and it is very time-consuming because i have to click on every article manually to download it.
Here is the sample URL:
“2008/10/02/20081002” to the current date in “yyyy/mm/dd/yyyymmdd” format
a A 001 1010 20
here, “a” never changes
“A” changes to any random alphabet in capital (e.g. “B”, “J” etc)
“001” is the page number…so if the page number is 12 then it would be “012”
“1010” never changes
“20” is the article number on that page …so it the article number is 6 then it is “06”
i tried this:
set the target_URL to "http://epaper.business-standard.com/bsepaper/pdf/2008/10/02/20081002aA001101020.pdf#zoom=130&statusbar=0&messages=0&toolbars=0&navpanes=0"
set the destination_file to ((path to desktop as string) & "a.pdf")
tell application "URL Access Scripting"
download target_URL to destination_file
end tell
and it works for a single file. (I think a login is compulsory before download but once i login from Safari then i can use this script)
Can someone plz help me in setting up the target_URL correctly?
Yes. I have often written such scripts for my former employee to automatically download electroplating patents from the internet into a large database containing acid copper plating patent information.
For example the first part, the date string, could be generated with code like follows:
set command to "date \"+%Y/%m/%d/%Y%m%d\""
set datestring to do shell script command
-- "2008/10/03/20081003"
The basic questions are how to find out the random letter and what article and page numbers can be used. Of course you can just try to download a PDF file at a certain generated URL and if it fails (try-block), well, then there is no article or page. But that is not ideal. It works, but is not efficient.
So it might be better to write a script that downloads the article while you are reading it in Safari. Then you could start it, the script reads the URL in the frontmost Safari window and generates the necessary URLs from this source URL. Then the script would already «know» about the article number and the random letter used.
Other questions are how and where to save the PDF files (single PDF files per page, or also combine all pages of an article into one PDF file; location?).
set x to 1
set the destination_file to ((path to desktop as string) & "b.pdf")
tell application "URL Access Scripting"
set the target_URL to "http://epaper.business-standard.com/bsepaper/pdf/2008/10/03/20081003a_01210100" & x & ".pdf#zoom=130&statusbar=0&messages=0&toolbars=0&navpanes=0"
repeat with x from 1 to 9
set x to x as number
set x to x + 1
download target_URL to destination_file
end repeat
end tell
i dont know why the random letter was just “_” for all pages today.
i set the date once manually and kept on changing the page number as you see i have a counter for the article number.
i would not mind having this…because i wrote a script to delete such files.
set ifolder to alias "Leopard:Users:lance:Desktop"
set dfolder to alias "Leopard:Users:lance:Desktop:for deleting"
tell application "Finder"
set allitems to every item of ifolder
repeat with aItem in allitems
get size of aItem
set z to the result
if z is less than 4096 and name extension of aItem is "pdf" then
move aItem to dfolder
end if
end repeat
end tell
Please help me in setting up a script so that i dont have to change the page number manually. Let us say all i want is first 12 articles of each page.
I just wrote you some sample code, but I could not test it, as I don’t have access to this newspaper website. It first asks you for todays random article char and then tries to download the articles to your desktop. I have written the code in a way that it is (hopefully) easy to understand AND modify. I know that it is not perfect yet, but maybe a shot in the right direction
property mytitle : "Paper-O-Mat"
property maxarticles : 20
property maxpages : 20
property destfolderpath : "Leopard:Users:lance:Desktop:"
property starturl : "http://epaper.business-standard.com/bsepaper/pdf/"
property endurl : ".pdf#zoom=130&statusbar=0&messages=0&toolbars=0&navpanes=0"
on run
set todaysrandomchar to my askfortodaysrandomchar()
if todaysrandomchar is missing value then
return
end if
set todaysdatestring to my gettodaysdatestring()
repeat with i from 1 to maxarticles
set articlenumber to my createarticlenumber(i)
repeat with i from 1 to maxpages
set pagenumber to my createpagenumber(i)
set pageurl to starturl & todaysdatestring & "a" & todaysrandomchar & pagenumber & "1010" & articlenumber & endurl
set filename to my createfilename(articlenumber, pagenumber)
set filepath to destfolderpath & filename
try
tell application "URL Access Scripting"
download pageurl to filepath
end tell
end try
end repeat
end repeat
end run
on askfortodaysrandomchar()
try
tell me
activate
display dialog "Please enter today's random character:" default answer "" buttons {"Cancel", "Enter"} default button 2 with title mytitle
set dlgresult to result
end tell
set usrinput to text returned of dlgresult
return usrinput
on error
return missing value
end try
end askfortodaysrandomchar
on gettodaysdatestring()
set command to "date \"+%Y/%m/%d/%Y%m%d\""
set todaysdatestring to do shell script command
return todaysdatestring
end gettodaysdatestring
on createarticlenumber(i)
if i < 10 then
return ("0" & i) as Unicode text
else
return (i as Unicode text)
end if
end createarticlenumber
on createpagenumber(i)
if i < 10 then
return ("00" & i) as Unicode text
else if i = 10 or (i > 10 and i < 100) then
return ("0" & i)
else if i = 100 or i > 100 then
return (i as Unicode text)
end if
end createpagenumber
on createfilename(articlenum, pagenum)
set command to "date \"+%Y%m%d\""
set datestring to do shell script command
set filename to datestring & "_" & articlenum & "_" & pagenum & ".pdf"
end createfilename
thank you, Martin Michel for your efforts…you gave me what i asked but sorry i did not ask for the right thing.
The images (i.e. advertisements) take up a lot of file size. So my newspaper turns out to be some 100 MB. (i just downloaded first 5 items with your script and stopped it and it was 10 MB. Is there a way to get size of the file first and then reject download it if it exceeds 300 KB ?
Whenever i manually download it using Safari, i get the file size before downloading (e.g. dowloading 1 KB of 125 KB) , so would it be possible with applescript too.
Thanks again for your efforts.
(if u want i can send you the username and password through private message)
I am currently in a hurry, so unfortunately I don’t have time to post code, but maybe I can give you some hints into the right direction. If you switch from downloading files with URL Access Scripting to «curl», then you could also use its arguments to get control over the file size:
Or
The forums contain a lot of examples about how to easily use curl in AppleScript’s to download files.
i was just trying whether i could download with curl, so i modified the above script like this
property maxarticles : 20
property maxpages : 10
.....
set articlenumber to my createarticlenumber(i)
set errorcnt to 0
......
try
do shell script "curl " & quoted form of pageurl & " -o " & quoted form of destfolderpath
on error
set errorcnt to errorcnt + 1
end try
......
if errorcnt > 0 then display dialog (errorcnt as text) & " errors while downloading files" buttons {"Done"} default button 1
end run
i get an error like “10 errors while downloading files” and nothing gets downloaded. Any ideas, why?
also plz help me setup the “max-filesize”
stop!!
i removed everything related to “error” display to see the error
problem is with the website…it is still showing newspaper dated 4th october instead of 5th october. I mean, it has not uploaded 5th October’s newspaper.
If you are using the curl command to download files, then you need to use (quoted) Posix paths instead of Mac paths:
Mac path: “Macintosh HD:Users:lance:Desktop:article.pdf”
Posix path: “/Users/lance/Desktop/article.pdf”
You can create a Posix path from a Mac path like this:
set macpath to "Macintosh HD:Users:lance:Desktop:article.pdf"
set posixpath to Posix path of macpath
-- quoting the path
set posixpath to quoted form of posixpath
If you want to control the maximum file size of a download, you can use the following curl syntax:
set command to "curl http://www.irs.gov/pub/irs-pdf/fw4.pdf -o /Users/lance/Desktop/test.pdf --max-filesize 20000"
try
do shell script command
end try
hi Martin! good u see u come back soon…hope you had a great time!!
what about URL? do i have to change them to POSIX ?
i get a weird really really long error (which i have uploaded here to save forum space: http://www.mediafire.com/?hcmmwj9htdw) when i did this:
…
on run
set destfolderpath to “Leopard:Users:lance:Desktop:”
set posixpath to POSIX path of destfolderpath
set posixpathq to quoted form of posixpath
…
set filepath to posixpathq & filename
try
do shell script "curl " & quoted form of pageurl & posixpath
also, will this have to be changed?
set filename to datestring & “" & articlenum & "” & pagenum & “.pdf”
No, only the Mac paths have to be modified in order to work with curl (or any other command line tool).
I have modified the initial script to use curl instead of URL Access Scripting, have a look at how it works.
property mytitle : "Paper-O-Mat"
-- 307200 bytes = 300 KB
property maxfilesize : "307200"
property maxarticles : 20
property maxpages : 20
property destfolderpath : "Leopard:Users:lance:Desktop:"
property starturl : "http://epaper.business-standard.com/bsepaper/pdf/"
property endurl : ".pdf#zoom=130&statusbar=0&messages=0&toolbars=0&navpanes=0"
on run
set todaysrandomchar to my askfortodaysrandomchar()
if todaysrandomchar is missing value then
return
end if
set todaysdatestring to my gettodaysdatestring()
repeat with i from 1 to maxarticles
set articlenumber to my createarticlenumber(i)
repeat with i from 1 to maxpages
set pagenumber to my createpagenumber(i)
set pageurl to starturl & todaysdatestring & "a" & todaysrandomchar & pagenumber & "1010" & articlenumber & endurl
set filename to my createfilename(articlenumber, pagenumber)
set filepath to destfolderpath & filename
set qtdposixfilepath to quoted form of POSIX path of filepath
set command to "curl " & pageurl & " -o " & qtdposixfilepath & " --max-filesize " & maxfilesize
return
-- alternative command (also quoted form of pageurl):
--set command to "curl " & quoted form of pageurl & " -o " & qtdposixfilepath & " --max-filesize 307200"
try
do shell script command
end try
end repeat
end repeat
end run
on askfortodaysrandomchar()
try
tell me
activate
display dialog "Please enter today's random character:" default answer "" buttons {"Cancel", "Enter"} default button 2 with title mytitle
set dlgresult to result
end tell
set usrinput to text returned of dlgresult
return usrinput
on error
return missing value
end try
end askfortodaysrandomchar
on gettodaysdatestring()
set command to "date \"+%Y/%m/%d/%Y%m%d\""
set todaysdatestring to do shell script command
return todaysdatestring
end gettodaysdatestring
on createarticlenumber(i)
if i < 10 then
return ("0" & i) as Unicode text
else
return (i as Unicode text)
end if
end createarticlenumber
on createpagenumber(i)
if i < 10 then
return ("00" & i) as Unicode text
else if i = 10 or (i > 10 and i < 100) then
return ("0" & i)
else if i = 100 or i > 100 then
return (i as Unicode text)
end if
end createpagenumber
on createfilename(articlenum, pagenum)
set command to "date \"+%Y%m%d\""
set datestring to do shell script command
set filename to datestring & "_" & articlenum & "_" & pagenum & ".pdf"
end createfilename
thanks Martin.
i hope i only had to run it and not modify it.
it does nothing except it asks for the random number and takes it.
no errors, nothing in the result.
plz try to use it if possible.
i had sent you a pm.
Yes, the script code contained an error. But using code as follows worked like a charm on my Mac. I also added a «log» statement for occuring errors, which you can see when you activate the «Event Log» in Script Editor.
property mytitle : "Paper-O-Mat"
-- 307200 bytes = 300 KB
property maxfilesize : "307200"
property maxarticles : 20
property maxpages : 20
property destfolderpath : "Leopard:Users:lance:Desktop:"
property starturl : "http://epaper.business-standard.com/bsepaper/pdf/"
property endurl : ".pdf#zoom=130&statusbar=0&messages=0&toolbars=0&navpanes=0"
on run
set todaysrandomchar to my askfortodaysrandomchar()
if todaysrandomchar is missing value then
return
end if
set todaysdatestring to my gettodaysdatestring()
repeat with i from 1 to maxarticles
set articlenumber to my createarticlenumber(i)
repeat with i from 1 to maxpages
set pagenumber to my createpagenumber(i)
set pageurl to starturl & todaysdatestring & "a" & todaysrandomchar & pagenumber & "1010" & articlenumber & endurl
set filename to my createfilename(articlenumber, pagenumber)
set filepath to destfolderpath & filename
set qtdposixfilepath to quoted form of POSIX path of filepath
set command to "curl " & quoted form of pageurl & " -o " & qtdposixfilepath & " --max-filesize " & maxfilesize
try
do shell script command
on error e
log e
end try
end repeat
end repeat
end run
on askfortodaysrandomchar()
try
tell me
activate
display dialog "Please enter today's random character:" default answer "" buttons {"Cancel", "Enter"} default button 2 with title mytitle
set dlgresult to result
end tell
set usrinput to text returned of dlgresult
return usrinput
on error
return missing value
end try
end askfortodaysrandomchar
on gettodaysdatestring()
set command to "date \"+%Y/%m/%d/%Y%m%d\""
set todaysdatestring to do shell script command
return todaysdatestring
end gettodaysdatestring
on createarticlenumber(i)
if i < 10 then
return ("0" & i) as Unicode text
else
return (i as Unicode text)
end if
end createarticlenumber
on createpagenumber(i)
if i < 10 then
return ("00" & i) as Unicode text
else if i = 10 or (i > 10 and i < 100) then
return ("0" & i)
else if i = 100 or i > 100 then
return (i as Unicode text)
end if
end createpagenumber
on createfilename(articlenum, pagenum)
set command to "date \"+%Y%m%d\""
set datestring to do shell script command
set filename to datestring & "_" & articlenum & "_" & pagenum & ".pdf"
end createfilename