OK. Using bits of everyone’s contributions so far, here are a couple of efforts which only keep the latest details in the database. They read all the existing Date, Description, Model, and Price details from the database beforehand and do running checks against these while they’re updating the entries. Any price drops found are saved up and displayed in TextEdit at the end. I’ve left out the “Page” entry in each record, as it was only recording the number of the first page in each batch, not the number of each page. (Edit: Now reinstated, with individual page numbers, in the vanilla script.) The ‘testing’ boolean at the top of the init() handler allows just a subset of the pages to be download for testing purposes.
The first script is entirely vanilla because I know how to do that. The second requires the Satimage OSAX and was put together around Chocoholic’s regex code to achieve the same end. I think they’re OK, but would be grateful to hear of any problems.
Vanilla script:
(* Vanilla version. (Requires Mac OS 10.6 or later and a machine set up to recognise full stops as decimal points and commas as thousands separators.) Get the relevant lines from the current page batch as individual lines. With each in turn, insert the required details into an sqlite3 command template and append the template to a list of such commands. Locate the equivalent details in a previously extracted dump of the database's current contents and compare the prices. If the new price is less than the old, add an entry to a "notifications" list. When all lines have been processed, coerce the command-batch list to text with a suitable sqlite3 head and tail and execute as a shell script to update the database. When all the batches have been processed, coerce the notifications list to text and display it in TextEdit. *)
on init() -- Initial set-up.
set testing to false -- Limits the number of pages fetched. Set to false for normal use.
set startPage to 1 -- Only heeded if testing is true, otherwise the script starts at page 1.
set endPage to 100 -- Only heeded if testing is true, otherwise the script continues to the last page. (Hopefully!)
-- How many pages will be downloaded at once? (Not too many. There's a limit to the length of text that can be used in a shell script. )
set batchSize to 50
script o -- For speed of access to items in these lists.
property dataLines : missing value
property activeTemplate : missing value
property commandBatch : missing value
property notifications : {"Home Depot Price Reductions Noted on " & date string of (current date), "Last updated | Description | Model number | Original Page Number | Previous price"}
end script
set dbPath to (path to documents folder as text) & "Databases:Home Depot.db"
set head to "sqlite3 " & quoted form of POSIX path of dbPath & " \""
set tail to "\""
try
dbPath as alias
set newDB to false
on error number -43
do shell script (head & "CREATE TABLE depot (id INTEGER PRIMARY KEY, Date DATETIME, Description TEXT, Model TEXT, Internet TEXT, Store_SKU TEXT, Store_SO_SKU TEXT, Page INTEGER, Price DOUBLE); CREATE UNIQUE INDEX Model_Description on depot (Model, Description); \"")
set newDB to true
end try
if (not testing) then
set startPage to 1
set endPage to batchSize
end if
return {testing, startPage, endPage, batchSize, o, head, tail, newDB}
end init
on main()
set {testing, startPage, targetPage, batchSize, o, head, tail, newDB} to init()
if (testing) then
set spokenTarget to targetPage
else
set spokenTarget to "unknown number"
end if
set astid to AppleScript's text item delimiters
repeat until (startPage > targetPage)
set endPage to startPage + batchSize - 1
if (endPage comes after targetPage) then set endPage to targetPage
say ("Pages " & startPage & (" to " & endPage) & (" of " & spokenTarget))
-- Get the next range of pages as a list of the relevant parts of the relevant lines. The relevant lines contain "Email Print" or are the "Next" and "Last" page links. The relevant parts are everything after "Email Print " up to and including the price, or the next or last page number. Single and double quotes (', `, ") are appropriately escaped for insertion into sqlite3 shell script strings.
set pageRange to "[" & startPage & "-" & endPage & "]"
set o's dataLines to paragraphs of (do shell script ("curl [url=http://www.homedepot.com/buy/?page=]http://www.homedepot.com/buy/?page="[/url] & pageRange & " | grep -Eo 'Email Print.*|<a href=\"\\?page=[0-9]*\">(Next|Last)' | sed -e 's/Email Print //' -e 's| / [A-Z][A-Z]-.*||' -e 's/<a href=\"\\?page=\\([0-9]*\\)\".*/\\1/' -e \"s/'/''/g\" -e 's/\\([\"`]\\)/\\\\\\1/g'"))
-- Parse the lines, compare old and new prices, and prepare a batch of sqlite commands
set o's commandBatch to {}
set totalPages to parseBatch(o, startPage, newDB, head)
-- The parsing handler returns the "Last page" number it's found in o's dataLines.
if (not testing) then
set targetPage to totalPages
set spokenTarget to totalPages
end if
-- Execute the sqlite commands to update the database.
if (o's commandBatch is not {}) then
set end of o's commandBatch to tail
set AppleScript's text item delimiters to ""
do shell script (head & o's commandBatch)
end if
set startPage to endPage + 1
end repeat
-- Display any lowered prices in TextEdit.
if (not newDB) then showPriceDrops(o)
set AppleScript's text item delimiters to astid
beep 2
end main
-- Parse each line in the data and insert the details into a command template
on parseBatch(o, thePage, newDB, head)
set blank to ""
-- Flag indicating how to treat a suspected number-only line. true = "Next" page number; false = "Last" page number.
set NextPageNumber to true
considering case
repeat with i from 1 to (count o's dataLines)
set thisLine to item i of o's dataLines
if ((count thisLine) < 6) then -- Page number only?
try
if (NextPageNumber) then
set thePage to thisLine as integer
set NextPageNumber to false
else
set totalPages to thisLine as integer
set NextPageNumber to true
end if
end try
else if (thisLine contains "error has occurred") then
-- Skip it.
else
try
-- Everything before the (last) " Model # " in the line is the description.
set AppleScript's text item delimiters to " Model # "
set theDescription to text 1 thru text item -2 of thisLine
set otherStuff to text item -1 of thisLine
-- The price is whatever comes after the last " $" in whatever we have left.
set AppleScript's text item delimiters to " $"
set thePrice to (word 1 of text item -1 of otherStuff) as real -- 'as real' gets rid of any thousands commas on machines set up to recognise them.
set otherStuff to text 1 thru text item -2 of otherStuff
-- The model number comes before the first " Internet # " or " Store S" (if either's present) .
set AppleScript's text item delimiters to {" Internet # ", " Store S"} -- Requires Snow Leopard or later.
set theModel to text item 1 of otherStuff
if (newDB) then
-- Assume that checking for an existing record with this Model/Description would return a blank.
set thisRecord to blank
else
-- Read any existing record with this Model/Description from the database.
set thisRecord to (do shell script (head & "SELECT Date, Description, Model, Page, Price FROM depot WHERE Model = '" & theModel & "' AND Description = '" & theDescription & "'; \""))
end if
if ((count thisRecord) > 0) then
-- Details found. What was the previous price?
set oldPrice to (word -1 of thisRecord) as real
if (thePrice = oldPrice) then
-- If the price hasn't changed, do nothing.
else
-- Otherwise insert an update command into the command batch.
if (thePrice < oldPrice) then set end of o's notifications to thisRecord & (" (Today's price = $" & thePrice & ")")
set end of o's commandBatch to {"UPDATE depot SET Price = ", thePrice, ", Date = datetime() WHERE Model = '", theModel, "' AND Description = '", theDescription, "'; "}
end if
else
-- No equivalent record exists. Construct a command to insert one.
set o's activeTemplate to {"INSERT OR IGNORE INTO depot (Date, Description, Model, Internet, Store_SKU, Store_SO_SKU, Page, Price) VALUES (datetime(), '", theDescription, "', '", theModel, "', '", blank, "', '", blank, "', '", blank, "', ", thePage, ", ", thePrice, "); "}
-- Internet, Store SKU, and Store SO SKU numbers are blocks of digits and will be the first 'word' after their respective labels.
dealWith("Internet # ", 6, otherStuff, o)
dealWith("Store SKU # ", 8, otherStuff, o)
dealWith("Store SO SKU # ", 10, otherStuff, o)
-- Append the filled-in template to the list of commands.
set end of o's commandBatch to o's activeTemplate
end if
end try
end if
end repeat
end considering
return totalPages
end parseBatch
-- Deal with an Internet, Store SKU, or Store SO SKU entry, all of which require smilar code.
on dealWith(theKey, x, otherStuff, o)
if (otherStuff contains theKey) then
set AppleScript's text item delimiters to theKey
try -- Any problems and this entry's simply abandoned to the template default.
set theValue to word 1 of text item 2 of otherStuff
theValue as number
set item x of o's activeTemplate to theValue
end try
end if
end dealWith
-- Coerce the notifications list to text and display the result in TextEdit.
on showPriceDrops(o)
if ((count o's notifications) is 2) then set end of o's notifications to "No prices lower than at last check."
set AppleScript's text item delimiters to return & return
set notifications to o's notifications as text
tell application "TextEdit"
activate
make new document at front with properties {text:notifications}
end tell
end showPriceDrops
main()
Regex script:
[Scrapped]
Edit: Modifications to vanilla script: Put the line-parsing code in a ‘try’ block so that any which are too horrendously malformed are simply skipped. Reinstated the Page field, having worked out how to get the number for each individual page in a batch. The “Last” page number is got at the same time, eliminating the need for a separate download of page 1 for this purpose. Also added a spoken report of what batch the script’s doing and made the testing set-up more flexible.
Edit 2: (10:02:2012) Replaced the vanilla script with the version which successfully completed the course in post #38 below. It skips price update checking when initially populating the database and doesn’t crash when it encounters “`”. Scrapped the Regex version which was here as I won’t be doing a parallel revision of it.