Ideas to clean & make faster

Nigel_Garvey · February 1, 2012, 2:24pm

Well. From what I’ve learned this morning from the sed tutorial, my shell script which fetches and trims the data from the Web .

set editedText to (do shell script "curl [url=http://www.homedepot.com/buy/?page=]http://www.homedepot.com/buy/?page="[/url] & pageRange & " | grep -o 'Email Print .*' | sed -e s/'Email Print '/''/g -e s/\"'\"/\"''\"/g")

. could be rendered .

set editedText to (do shell script "curl [url=http://www.homedepot.com/buy/?page=]http://www.homedepot.com/buy/?page="[/url] & pageRange & " | grep -o 'Email Print .*' | sed -e 's/Email Print //' -e 's| / [A-Z][A-Z]-.*||' -e \"s/'/''/g\"")

. which also strips everything after the price in each line, leaving only the data actually required and saving having to get ‘text 1 thru text item -3’ of each line a bit further down. So far, my attempts to combine the stripping of the beginnings and ends of the lines in one operation have resulted in a v-e-r-y much slower parsing process. I’m sure that’s almost entirely due to my still limited knowledge of regular expressions.

adayzdone · February 1, 2012, 3:58pm

Although this could not be applied to a page range, yql can be helpful in extracting text from pages as well.

For instance,
You can extract items from this page:
http://www.homedepot.com/webapp/wcs/stores/servlet/Bopis2OverLay?langId=-1&storeId=10051&catalogId=10053&R=202368271&storeSkuNum=682101&Overlay_Type=add_to_cart&locStoreNum=1248&pageNum=0&mode=localstore&basePage=PIP&quantity=1

by using this query:
http://y.ahoo.it/e+U7l

and then copying the REST QUERY into your script

Chocoholic · February 2, 2012, 5:45pm

Here’s my attempt:

set loc to "~/Documents/Databases/homedepot.db"
set head to "sqlite3" & space & loc & space & quote
set tail to quote
set newTable to "CREATE TABLE depot (id INTEGER PRIMARY KEY, Date DATETIME, Description TEXT, Model TEXT, Internet TEXT, Store_Sku TEXT, Store_SO_Sku TEXT, Old_price DOUBLE, New_price DOUBLE, page INTEGER); CREATE UNIQUE INDEX Model_Description on depot (Model, Description)"

tell application "Finder"
	if not (exists (path to documents folder as text) & "Databases:homedepot.db") then
		do shell script head & newTable & tail
	end if
end tell

-- Find total number of pages
set pageSource to (do shell script "curl [url=http://www.homedepot.com/buy/?page=1)]http://www.homedepot.com/buy/?page=1")[/url]
set AppleScript's text item delimiters to {"<a href=\"?page="}
set totalPages to first word of (text item -1 of pageSource)

set batch to 10 -- How many pages will be downloaded at once
set firstpage to 1 -- First page to be downloaded

repeat with startPage from firstpage to totalPages by batch
	try
		if not totalPages - startPage is less than batch then
			set batchList to do shell script "curl http://www.homedepot.com/buy/?page=[" & startPage & "-" & startPage + batch - 1 & "]  | grep -o 'Email Print.*'"
		else
			set batchList to do shell script "curl http://www.homedepot.com/buy/?page=[" & startPage & "-" & totalPages & "]  | grep -o 'Email Print.*'"
		end if
		tell application "TextWrangler"
			set batchList to replace "^.*An error has occurred.*\\r" using "" searchingString batchList options {search mode:grep}
			set batchList to replace "'" using "''" searchingString batchList options {search mode:grep}
			set batchList to replace "\"" using "\\\\\"" searchingString batchList options {search mode:grep}
			set batchList to replace "^[^$\\r]*Email Print" using "" searchingString batchList options {search mode:grep}
			set batchList to replace "^[ ]*(.*) (Model # [^$\\r]*) \\$(([0-9]*),*([0-9.]*)).*" using "\\1	\\4\\5	\\2" searchingString batchList options {search mode:grep}
			-- description <tab> price <tab> "Model #".
			set batchList to replace "^(.*)	(.*)	Model # (.*) (Internet # .*)" using "\\1	\\2	\\3	\\4" searchingString batchList options {search mode:grep}
			set batchList to replace "^(.*)	(.*)	Model # (.*) (Store SKU.*)" using "\\1	\\2	\\3	\\4" searchingString batchList options {search mode:grep}
			set batchList to replace "^(.*)	(.*)	Model # (.*) (Store SO SKU.*)" using "\\1	\\2	\\3	\\4" searchingString batchList options {search mode:grep}
			set batchList to replace "^(.*)	(.*)	Model # (.*)" using "\\1	\\2	\\3	" searchingString batchList options {search mode:grep}
			-- description <tab> price <tab> Model <tab> .
			set batchList to replace "^(.*)	(.*)	(.*)	(Internet # ([0-9]{9}))*[ ]*(Store SKU # ([0-9]{6}))*[ ]*(Store SO SKU # ([0-9]{6}))*" using "\\1	\\2	\\3	\\5	\\7	\\9" searchingString batchList options {search mode:grep}
			-- description <tab> price <tab> Model <tab> Internet <tab> Store SKU <tab> Store SO SKU
			set batchList to replace "^(.*)	(.*)	(.*)	(.*)	(.*)	(.*)" using "UPDATE depot SET New_price = \\2, Date = datetime(), Page = " & startPage & " WHERE Model = '\\3' AND Description = '\\1'; INSERT OR IGNORE INTO depot (Description, New_price, Model, Internet, Store_SKU, Store_SO_SKU, Date, Page) VALUES ('\\1', \\2, '\\3', '\\4', '\\5', '\\6', datetime(), " & startPage & ");" searchingString batchList options {search mode:grep}
		end tell
		do shell script head & batchList & tail -- Enter the batch into the database
	on error
		log startPage & " error"
	end try
end repeat

The script uses grep and TextWrangler to extract the data, sed is too slow.
In the database are two prices, the new price and the old price. Before each run the new price can be copied to the old price with the SQL statement “UPDATE depot SET Old_price = New_price, New_Price = NULL”. The SQL for finding the price drops is “SELECT * FROM depot WHERE new_price < old_price”.
I only tested 100 pages, my internet connection isn’t very fast, downloading all pages takes 5 hours.

Nigel_Garvey · February 3, 2012, 2:01pm

Great! Keeping just one record for each item, with fields for the two most recent prices, saves database bloat and makes subesquent use of the data easier. (That’s probably foolishly obvious to someone who handles databases regularly!) The script’s also a good education in the use of regular expressions. Thank you!

I didn’t realise TextWrangler could manipulate text without necessarily having it in one of its windows. But if you’re going to rely on third-party software to handle the regex, the Satimage OSAX is considerably faster and doesn’t require another application to be open. For comparison, the syntax is:

-- TextWrangler.
tell application "TextWrangler"
	set batchList to (replace "^.*An error has occurred.*\\r" using "" searchingString batchList options {search mode:grep})
	set batchList to (replace "'" using "''" searchingString batchList options {search mode:grep})
end tell

-- Satimage OSAX.
set batchList to (change "^.*An error has occurred.*\\r" into "" in batchList with regexp)
set batchList to (change "'" into "''" in batchList with regexp) -- or, in this case, just (change "'" into "''" in batchList)

A single ‘change’ command can batch-process several substitutions, which is slightly faster than doing them individually, but obviously not as easy to edit or read:

set batchList to (change {"^.*An error has occurred.*\\r", "'"} into {"", "''"} in batchList with regexp)

So a Satimage version might look something like this:

on main()
	set loc to quoted form of POSIX path of ((path to documents folder as text) & "Databases:Home Depot.db")
	set head to "sqlite3" & space & loc & space & quote
	set tail to quote
	
	set newTable to "CREATE TABLE IF NOT EXISTS depot (id INTEGER PRIMARY KEY, Date DATETIME, Description TEXT, Model TEXT, Internet TEXT, Store_Sku TEXT, Store_SO_Sku TEXT, Old_price DOUBLE, New_price DOUBLE, page INTEGER); CREATE UNIQUE INDEX IF NOT EXISTS Model_Description on depot (Model, Description); "
	do shell script (head & newTable & tail)
	
	-- Find total number of pages
	set pageSource to (do shell script "curl [url=http://www.homedepot.com/buy/?page=1)]http://www.homedepot.com/buy/?page=1")[/url]
	set AppleScript's text item delimiters to {"<a href=\"?page="}
	set totalPages to (first word of text item -1 of pageSource) as integer
	-- set totalPages to 100
	
	set batch to 10 -- How many pages will be downloaded at once
	set firstpage to 1 -- First page to be downloaded
	
	repeat with startPage from firstpage to totalPages by batch
		try
			set endPage to startPage + batch - 1
			if (endPage > totalPages) then set endPage to totalPages
			
			set batchList to (do shell script ("curl http://www.homedepot.com/buy/?page=[" & startPage & "-" & endPage & "]  | grep -o 'Email Print.*'"))
			
			-- This command requires the Satimage OSAX.
			set batchList to (change {"^.*An error has occurred.*\\r", "'", "\"", "^[^$\\r]*Email Print", "^[ ]*(.*) (Model # [^$\\r]*) \\$(([0-9]*),*([0-9.]*)).*", "^(.*)	(.*)	Model # (.*) (Internet # .*)", "^(.*)	(.*)	Model # (.*) (Store SKU.*)", "^(.*)	(.*)	Model # (.*) (Store SO SKU.*)", "^(.*)	(.*)	Model # (.*)", "^(.*)	(.*)	(.*)	(Internet # ([0-9]{9}))*[ ]*(Store SKU # ([0-9]{6}))*[ ]*(Store SO SKU # ([0-9]{6}))*", "^(.*)	(.*)	(.*)	(.*)	(.*)	(.*)"} into {"", "''", "\\\\\"", "", "\\1	\\4\\5	\\2", "\\1	\\2	\\3	\\4", "\\1	\\2	\\3	\\4", "\\1	\\2	\\3	\\4", "\\1	\\2	\\3	", "\\1	\\2	\\3	\\5	\\7	\\9", "UPDATE depot SET New_price = \\2, Date = datetime(), Page = " & startPage & " WHERE Model = '\\3' AND Description = '\\1'; INSERT OR IGNORE INTO depot (Description, New_price, Model, Internet, Store_SKU, Store_SO_SKU, Date, Page) VALUES ('\\1', \\2, '\\3', '\\4', '\\5', '\\6', datetime(), " & startPage & ");"} in batchList with regexp)
			
			do shell script (head & batchList & tail) -- Enter the batch into the database
		on error
			log (startPage & " error")
		end try
	end repeat
end main

main()

Edit: Added “IF NOT EXISTS” to the second part of the ‘newTable’ shell script too. Deleted inadvertently duplicated string pair in the ‘change’ command.

adayzdone · February 3, 2012, 2:40pm

I have learned a lot from this thread, thanks guys. Is there an advantage to using a 3rd party app for replacements? Is it faster, or are the scripts just easier to read & write?

Chocoholic · February 3, 2012, 6:09pm

I was experimenting. I expected sed to be faster than AppleScript but is was 10 times slower. TextWrangler is faster than sed and a bit slower than AppleScript. I think replacing is easier to maintain than splitting on characters. After downloading all pages I did some searching, here’s what I found:

There are articles without a price and without a $ in their line. If I look them up on the website there is a price. For example
GE Profile Built-In Tall Tub Dishwasher in Black (PDWT300VBB)
UL120-17 Terra Cotta Urn Interior Eggshell Gallon Paint (275301)

Some articles have a price of $0.00. If I look them up on the website there is a price. For example
DANCO Lavatory Sink Basin Hangers (2-Pack) (88735)
RIDGID 3-Amp JobMax Starter Kit SELECT FREE BLADE KIT IN STORE (R28600)

Many articles have two prices: a regular price and an action price, which are usually the same. There can also be some text like “Ships FREE with $45.00 Order”.
For example
Glomar 1-Light Pendant Alabaster Glass Bell Textured White Model # HD-397 Internet # 202645850 $59.19 / EA-Each $59.19 / EA-Each Ships FREE with $45.00 Order
Rolling work bench Model # WS-MWB-3D Internet # 203006472 $299.00 / BX-Box $199.99 / BX-Box This item does not qualify for free shipping
Merola Tile Padova 12-1/2 in. x 12-1/2 in. Almond Ceramic Floor and Wall Tile Model # FAZ12PA8 Internet # 202632869 Store SO SKU # 467127 $34.97 / CA-Case Ships FREE with $45.00 Order

Some articles have a line with Email after $. For example
E L Mustee Molded Faucet Block Model # 20.600 Internet # 202041461 $9.99
Panasonic Silver Mini 1.5-cup Mini Rice Cooker with Glass Lid Model # SR-3NAS Internet # 100672284 $49.99

I didn’t search for them but I accidentally found an article that is on two pages
Leviton 1-Gang White Switch Wallplate Model # R52-88001-00W
On page 2044 the price is $0.25, on page 4553 the price is $0.22, on the website the price is $0.27.

I like data mining.

Nigel_Garvey · February 3, 2012, 8:21pm

And the Norcal 8-1/2 in. Terra-Cotta Fern Pot Model # 100043027 Store SKU # 208426 Store SO SKU # 10220 $3.98 only has five digits in its Store SO SKU.

Nigel_Garvey · February 4, 2012, 10:47am

That’s easily catered for by changing “{6}” (and possibly “{9}”) to “*” in the regex.

The script doesn’t update Old_price! It would probably be useful to have Old_ and New_ dates too.

Or of course just keep the latest date and price and generate output when the new price is less than the one it’s replacing.

Nigel_Garvey · February 6, 2012, 3:20pm

OK. Using bits of everyone’s contributions so far, here are a couple of efforts which only keep the latest details in the database. They read all the existing Date, Description, Model, and Price details from the database beforehand and do running checks against these while they’re updating the entries. Any price drops found are saved up and displayed in TextEdit at the end. I’ve left out the “Page” entry in each record, as it was only recording the number of the first page in each batch, not the number of each page. (Edit: Now reinstated, with individual page numbers, in the vanilla script.) The ‘testing’ boolean at the top of the init() handler allows just a subset of the pages to be download for testing purposes.

The first script is entirely vanilla because I know how to do that. The second requires the Satimage OSAX and was put together around Chocoholic’s regex code to achieve the same end. I think they’re OK, but would be grateful to hear of any problems.

Vanilla script:

(* Vanilla version. (Requires Mac OS 10.6 or later and a machine set up to recognise full stops as decimal points and commas as thousands separators.) Get the relevant lines from the current page batch as individual lines. With each in turn, insert the required details into an sqlite3 command template and append the template to a list of such commands. Locate the equivalent details in a previously extracted dump of the database's current contents and compare the prices. If the new price is less than the old, add an entry to a "notifications" list. When all lines have been processed, coerce the command-batch list to text with a suitable sqlite3 head and tail and execute as a shell script to update the database. When all the batches have been processed, coerce the notifications list to text and display it in TextEdit. *)

on init() -- Initial set-up.
	set testing to false -- Limits the number of pages fetched. Set to false for normal use.
	set startPage to 1 -- Only heeded if testing is true, otherwise the script starts at page 1.
	set endPage to 100 -- Only heeded if testing is true, otherwise the script continues to the last page. (Hopefully!)
	
	-- How many pages will be downloaded at once? (Not too many. There's a limit to the length of text that can be used in a shell script. )
	set batchSize to 50
	
	script o -- For speed of access to items in these lists.
		property dataLines : missing value
		property activeTemplate : missing value
		property commandBatch : missing value
		property notifications : {"Home Depot Price Reductions Noted on " & date string of (current date), "Last updated | Description | Model number | Original Page Number | Previous price"}
	end script
	
	set dbPath to (path to documents folder as text) & "Databases:Home Depot.db"
	set head to "sqlite3 " & quoted form of POSIX path of dbPath & " \""
	set tail to "\""
	try
		dbPath as alias
		set newDB to false
	on error number -43
		do shell script (head & "CREATE TABLE depot (id INTEGER PRIMARY KEY, Date DATETIME, Description TEXT, Model TEXT, Internet TEXT, Store_SKU TEXT, Store_SO_SKU TEXT, Page INTEGER, Price DOUBLE); CREATE UNIQUE INDEX Model_Description on depot (Model, Description); \"")
		set newDB to true
	end try
	
	if (not testing) then
		set startPage to 1
		set endPage to batchSize
	end if
	
	return {testing, startPage, endPage, batchSize, o, head, tail, newDB}
end init

on main()
	set {testing, startPage, targetPage, batchSize, o, head, tail, newDB} to init()
	
	if (testing) then
		set spokenTarget to targetPage
	else
		set spokenTarget to "unknown number"
	end if
	
	set astid to AppleScript's text item delimiters
	
	repeat until (startPage > targetPage)
		set endPage to startPage + batchSize - 1
		if (endPage comes after targetPage) then set endPage to targetPage
		
		say ("Pages " & startPage & (" to " & endPage) & (" of " & spokenTarget))
		
		-- Get the next range of pages as a list of the relevant parts of the relevant lines. The relevant lines contain "Email Print" or are the "Next" and "Last" page links. The relevant parts are everything after "Email Print " up to and including the price, or the next or last page number. Single and double quotes (', `, ") are appropriately escaped for insertion into sqlite3 shell script strings.
		set pageRange to "[" & startPage & "-" & endPage & "]"
		set o's dataLines to paragraphs of (do shell script ("curl [url=http://www.homedepot.com/buy/?page=]http://www.homedepot.com/buy/?page="[/url] & pageRange & " | grep -Eo 'Email Print.*|<a href=\"\\?page=[0-9]*\">(Next|Last)' | sed -e 's/Email Print //' -e 's| / [A-Z][A-Z]-.*||' -e 's/<a href=\"\\?page=\\([0-9]*\\)\".*/\\1/' -e \"s/'/''/g\" -e 's/\\([\"`]\\)/\\\\\\1/g'"))
		
		-- Parse the lines, compare old and new prices, and prepare a batch of sqlite commands
		set o's commandBatch to {}
		set totalPages to parseBatch(o, startPage, newDB, head)
		
		-- The parsing handler returns the "Last page" number it's found in o's dataLines.
		if (not testing) then
			set targetPage to totalPages
			set spokenTarget to totalPages
		end if
		
		-- Execute the sqlite commands to update the database.
		if (o's commandBatch is not {}) then
			set end of o's commandBatch to tail
			set AppleScript's text item delimiters to ""
			do shell script (head & o's commandBatch)
		end if
		
		set startPage to endPage + 1
	end repeat
	
	-- Display any lowered prices in TextEdit.
	if (not newDB) then showPriceDrops(o)
	
	set AppleScript's text item delimiters to astid
	beep 2
end main

-- Parse each line in the data and insert the details into a command template 
on parseBatch(o, thePage, newDB, head)
	set blank to ""
	
	-- Flag indicating how to treat a suspected number-only line. true = "Next" page number; false = "Last" page number.
	set NextPageNumber to true
	
	considering case
		repeat with i from 1 to (count o's dataLines)
			set thisLine to item i of o's dataLines
			if ((count thisLine) < 6) then -- Page number only?
				try
					if (NextPageNumber) then
						set thePage to thisLine as integer
						set NextPageNumber to false
					else
						set totalPages to thisLine as integer
						set NextPageNumber to true
					end if
				end try
			else if (thisLine contains "error has occurred") then
				-- Skip it.
			else
				try
					-- Everything before the (last) " Model # " in the line is the description.
					set AppleScript's text item delimiters to " Model # "
					set theDescription to text 1 thru text item -2 of thisLine
					set otherStuff to text item -1 of thisLine
					
					-- The price is whatever comes after the last " $" in whatever we have left.
					set AppleScript's text item delimiters to " $"
					set thePrice to (word 1 of text item -1 of otherStuff) as real -- 'as real' gets rid of any thousands commas on machines set up to recognise them.
					set otherStuff to text 1 thru text item -2 of otherStuff
					
					-- The model number comes before the first " Internet # " or " Store S" (if either's present) .
					set AppleScript's text item delimiters to {" Internet # ", " Store S"} -- Requires Snow Leopard or later.
					set theModel to text item 1 of otherStuff
					
					if (newDB) then
						-- Assume that checking for an existing record with this Model/Description would return a blank.
						set thisRecord to blank
					else
						-- Read any existing record with this Model/Description from the database.
						set thisRecord to (do shell script (head & "SELECT Date, Description, Model, Page, Price FROM depot WHERE Model = '" & theModel & "' AND Description = '" & theDescription & "'; \""))
					end if
					if ((count thisRecord) > 0) then
						-- Details found. What was the previous price?
						set oldPrice to (word -1 of thisRecord) as real
						
						if (thePrice = oldPrice) then
							-- If the price hasn't changed, do nothing.
						else
							-- Otherwise insert an update command into the command batch.
							if (thePrice < oldPrice) then set end of o's notifications to thisRecord & (" (Today's price = $" & thePrice & ")")
							set end of o's commandBatch to {"UPDATE depot SET Price = ", thePrice, ", Date = datetime() WHERE Model = '", theModel, "' AND Description = '", theDescription, "'; "}
						end if
					else
						-- No equivalent record exists. Construct a command to insert one.
						set o's activeTemplate to {"INSERT OR IGNORE INTO depot (Date, Description, Model, Internet, Store_SKU, Store_SO_SKU, Page, Price) VALUES (datetime(), '", theDescription, "', '", theModel, "', '", blank, "', '", blank, "', '", blank, "', ", thePage, ", ", thePrice, "); "}
						
						-- Internet, Store SKU, and Store SO SKU numbers are blocks of digits and will be the first 'word' after their respective labels.
						dealWith("Internet # ", 6, otherStuff, o)
						dealWith("Store SKU # ", 8, otherStuff, o)
						dealWith("Store SO SKU # ", 10, otherStuff, o)
						
						-- Append the filled-in template to the list of commands.
						set end of o's commandBatch to o's activeTemplate
					end if
				end try
			end if
		end repeat
	end considering
	
	return totalPages
end parseBatch

-- Deal with an Internet, Store SKU, or Store SO SKU entry, all of which require smilar code.
on dealWith(theKey, x, otherStuff, o)
	if (otherStuff contains theKey) then
		set AppleScript's text item delimiters to theKey
		try -- Any problems and this entry's simply abandoned to the template default.
			set theValue to word 1 of text item 2 of otherStuff
			theValue as number
			set item x of o's activeTemplate to theValue
		end try
	end if
end dealWith

-- Coerce the notifications list to text and display the result in TextEdit.
on showPriceDrops(o)
	if ((count o's notifications) is 2) then set end of o's notifications to "No prices lower than at last check."
	
	set AppleScript's text item delimiters to return & return
	set notifications to o's notifications as text
	
	tell application "TextEdit"
		activate
		make new document at front with properties {text:notifications}
	end tell
end showPriceDrops

main()

Regex script:

[Scrapped]

Edit: Modifications to vanilla script: Put the line-parsing code in a ‘try’ block so that any which are too horrendously malformed are simply skipped. Reinstated the Page field, having worked out how to get the number for each individual page in a batch. The “Last” page number is got at the same time, eliminating the need for a separate download of page 1 for this purpose. Also added a spoken report of what batch the script’s doing and made the testing set-up more flexible.
Edit 2: (10:02:2012) Replaced the vanilla script with the version which successfully completed the course in post #38 below. It skips price update checking when initially populating the database and doesn’t crash when it encounters “`”. Scrapped the Regex version which was here as I won’t be doing a parallel revision of it.

adayzdone · February 6, 2012, 8:23pm

Took me a while to get through that. You are the man.

Nigel_Garvey · February 7, 2012, 3:08pm

I’ve made a few modifications to the vanilla script a couple of posts up in the light of my experience trying to run it to completion last night. The line-parsing code is now in a ‘try’ block so that the worst-formed lines are simply skipped. The Page field is reinstated, as I’ve worked out how to get the individual page numbers. The “Last” page number is got in the same way and, since we don’t actually need to know it before treating the first batch (unless we’re only attempting a batch near the end), page 1 doesn’t need to be downloaded separately beforehand. There’s also a spoken commentary on the script’s progress and the test-run parameters are more flexible. I haven’t given it a full run yet, but am just about to do so.

I haven’t worked out how to effect these modifications in the regex script yet.

Chocoholic · February 8, 2012, 2:27pm

I know it’s not a competition but after a slow sed I was curious which solution is faster and I ran some test:

run the script without an existing database
run the script again
change all dates to two days ago, change the price of articles whose model # starts with X and run the script again

Total number of pages is 1000, batch is 10, 1 article whose model # starts with X, the pages are not downloaded but read from files.

Vanilla script

50 sec. 13738 articles
552 sec. 2 price changes
554 sec. 3 price changes

Regex script

54 sec. 17008 articles
423 sec.
423 sec. no price changes

My script

53 sec. 17008 articles
45 sec.
49 sec. 1 price change

I ran the same test with the Vanilla script with 500 and 750 pages.

500 pages: 25 sec., 750 pages: 36 sec., 1000 pages: 50 sec.
500 pages: 148 sec., 750 pages: 277 sec., 1000 pages: 552 sec.
All pages will take a while, my rough estimate is 38 hours, I’m not going to test it.

My script with all 16635 pages:

1695 sec. 290321 articles
1002 sec.
969 sec. 358 price changes

I’m still working on my script. In the future there will be lines in a new format. Lines that can’t be parsed are skipped and should be logged. This will make the script a bit slower.

Nigel_Garvey · February 8, 2012, 3:38pm

:lol:

Yesterday afternoon, I attempted to run the vanilla script (the version in post #31) with all the pages on the site, starting with a fresh database. After three hours, it had completed less than 5000 pages.

Then I tried another variation which only put the appropriate INSERT or UPDATE commands into the sqlite3 chain instead of both every time. This would hopefully reduce execution time and allow larger batches. I ran it for three hours last night with a fresh database and it cleared 7000 pages. I don’t know if this was entirely due to faster code or perhaps to better Internet conditions.

When I ran it again this morning with the database it created last night, it took forever to get through just one batch! This appears to be because it created a much larger database dump when it started, which the vanilla code took correspondingly longer to parse. It’s actually faster to query the database with a shell script every time at this stage.

Personally, I’m not sure there is a sensible practical solution for adayzdone’s purposes. As you said earlier, there’s a huge number of pages to monitor and quite a lot of junk in them about which special decisions have to be made. I have however been learning a lot fooling around with this.

I look forward to seeing your script when it’s ready.

adayzdone · February 8, 2012, 4:04pm

Post #14 populated the database in 3 hrs 10 min. I think any improvements may have to come from better database design, maybe multiple tables, and then the right query after new data has been populated. I am going to do some reading and then come back to it. I have learned a lot from the different approaches posted here. Thanks.

Chocoholic · February 9, 2012, 1:36pm

It is a bit unpolished but here is my script, it requires the Satimage OSAX:

set loc to "~/Documents/Databases/homedepot.db"
set head to "sqlite3" & space & loc & space & quote
set tail to quote
set newTable to "CREATE TABLE depot (id INTEGER PRIMARY KEY, Date DATETIME, Description TEXT, Model TEXT, Internet TEXT, Store_Sku TEXT, Store_SO_Sku TEXT, Old_price DOUBLE, New_price DOUBLE); CREATE UNIQUE INDEX Model_Description on depot (Model, Description)"

tell application "Finder"
	if not (exists (path to documents folder as text) & "Databases:homedepot.db") then
		do shell script head & newTable & tail
	end if
end tell

-- Find total number of pages
set pageSource to (do shell script "curl [url=http://www.homedepot.com/buy/?page=1)]http://www.homedepot.com/buy/?page=1")[/url]
set AppleScript's text item delimiters to {"<a href=\"?page="}
set totalPages to first word of (text item -1 of pageSource)
set totalPages to 50

set batch to 10 -- How many pages will be downloaded at once
set firstpage to 1 -- First page to be downloaded

-- Move new price to old price if the new price is older than 1 day
do shell script head & "UPDATE depot SET Old_price = New_price, New_Price = NULL WHERE date < datetime('now', '-1 day')" & tail

set findList to {}
set replacementList to {}

-- Remove lines with errors
copy ".*An error has occurred.*" to the end of findList
copy "" to the end of replacementList
copy ".*Not available in your currently localized store.*" to the end of findList
copy "" to the end of replacementList

-- Remove lines without a $
copy "^[^$]*$" to the end of findList
copy "" to the end of replacementList

-- Replacements for do shell script
copy "'" to the end of findList
copy "''" to the end of replacementList
copy "`" to the end of findList
copy "\\\\`" to the end of replacementList
copy "\"" to the end of findList
copy "\\\\\"" to the end of replacementList

-- Remove . Email Print at the beginning of the line
copy "^[^$\\r]*Email Print *" to the end of findList
copy "" to the end of replacementList

-- Description and Model
copy "^(.*) Model # *([^\\t\\r]*) (Internet {0,1}#)" to the end of findList
copy "\\1\\t\\2\\t\\3" to the end of replacementList
copy "^(.*) Model # *([^\\t\\r]*) (Store SKU {0,1}#)" to the end of findList
copy "\\1\\t\\2\\t\\3" to the end of replacementList
copy "^(.*) Model # *([^\\t\\r]*) (Store SO SKU {0,1}#)" to the end of findList
copy "\\1\\t\\2\\t\\3" to the end of replacementList
copy "^(.*) Model # *([^\\t\\r]*?) (\\$[0-9])" to the end of findList
copy "\\1\\t\\2\\t\\3" to the end of replacementList
-- description <tab> model <tab> .

-- Internet, Store SKU and Store SO SKU
copy "Store SKU # Store SO SKU" to the end of findList
copy "Store SO SKU" to the end of replacementList
copy "Store SKU # \\$" to the end of findList
copy "$" to the end of replacementList
copy "^(.*?\\t.*?)\\t(Internet {0,1}# (.+?) )*(Store SKU {0,1}# (.*?) )*(Store SO SKU {0,1}# (.+?) )*(Reviews )*" to the end of findList
copy "\\1\\t\\3\\t\\5\\t\\7\\t" to the end of replacementList
-- description <tab> model <tab> Internet <tab> Store SKU <tab> Store SO SKU <tab> $.

-- Prices
copy "^(.*)\\t\\$([0-9]*),*([0-9.]*) [^$\\r]*?\\$([0-9]*),*([0-9.]*).*" to the end of findList
copy "\\1\\t\\2\\3\\t\\4\\5" to the end of replacementList
copy "^(.*)\\t\\$(([0-9]*),)*([0-9.]*).*" to the end of findList
copy "\\1\\t\\3\\4\\t\\3\\4" to the end of replacementList
-- description <tab> model <tab>  Internet <tab> Store SKU <tab> Store SO SKU <tab> regular price <tab> action price

-- SQL
copy "^(.*)\\t(.*)\\t(.*)\\t(.*)\\t(.*)\\t(.*)\\t(.*)$" to the end of findList
copy "UPDATE depot SET New_price = \\7, Date = datetime() WHERE Model = '\\2' AND Description = '\\1'; INSERT OR IGNORE INTO depot (Description, Model, Internet, Store_SKU, Store_SO_SKU, New_price, Date) VALUES ('\\1', '\\2', '\\3', '\\4', '\\5', \\7, datetime());" to the end of replacementList

set unparsedList to ""
repeat with startPage from firstpage to totalPages by batch
	try
		set endPage to startPage + batch - 1
		if endPage > totalPages then set endPage to totalPages
		try
			set batchList to missing value
			set batchList to (do shell script ("curl [url=http://www.homedepot.com/buy/?page=]http://www.homedepot.com/buy/?page="[/url] & "[" & startPage & "-" & endPage & "]" & " | grep -o 'Email Print.*'"))
		end try
		if batchList â‰  missing value then
			set batchList to change findList into replacementList in batchList with regexp
			if (batchList contains tab) then
				set unparsedList to unparsedList & (matchResult of (find text "^.*\\t.*$" in batchList with regexp)) & return
				set batchList to matchResult of (find text "^UPDATE.*" in batchList with regexp)
			end if
			do shell script head & batchList & tail -- Enter the batch into the database
		end if
	on error error_message number error_number
		if error_number = -128 then
			error error_message number error_number
		end if
		set unparsedList to unparsedList & startPage & (" Error: " & error_number & ". " & error_message) as text
	end try
end repeat

-- find dropped prices
set prices to "Prices on " & (current date) & return & return & (do shell script "echo '.mode tabs \\n SELECT old_price, new_price, model, description FROM depot WHERE new_price < old_price;' | sqlite3" & space & loc)
tell application "TextEdit"
	activate
	if unparsedList â‰  "" then
		make new document at front with properties {text:"Unparsed" & return & return & unparsedList}
	end if
	make new document at front with properties {text:prices}
end tell

After a bit of practice you can do a lot with one SQL statement. Examples used for testing:
Change all dates to 2 days ago: UPDATE depot SET date = datetime(‘now’, ‘-2 day’)
Raise the price of articles whose model # starts with X: UPDATE depot SET new_price = (new_price + 1) WHERE model LIKE “X%”

me too, and I’m still amazed by the variety of items.

Edit: script escapes ` and removed log

Nigel_Garvey · February 10, 2012, 12:58am

Well that’s certainly not possible with my broadband connection! But in any case, post #31 (vanilla) also checks the previous prices while it’s working and (if I’ve understood Chocoholic’s sqlite3 code correctly) ensures that only one entry appears in the database for each item. And it spends at least 38 minutes just telling you what it’s doing! :lol:

I tried out yet another version of it last night which crashed after four hours while handling page batch 8551-8600. The cause turned out to be a “`” (character id 96) on page 8584 which was upsetting the quote matching in the sqlite3 command. The character’s obviously not meant to be there, but I’ve patched the script to escape it in case there are instances elsewhere where it is meant.

After nursemaiding the script through the problem batch this morning, I left it to continue by itself this afternoon and it finished a few seconds under four hours from the point at which it resumed. So just over 8 hours in all. It’ll take longer on subsequent runs

As Chocoholic pointed out earlier, some of the items appear to be duplicated, so it’ll be necessary to decide whether to keep all or just one of these ” and if just one, which one. In either case, the updating code will have to ensure that the right update is matched to the right entry.

I’ve a busy day tomorrow, but i hope to be able to study Chocoholic’s new script at some point.

adayzdone · February 10, 2012, 7:55pm

I came across the same thing … from #14. It is amazing the lack of consistency across the data.

 if theDescription contains "`" then
               set AppleScript's text item delimiters to {"`"}
               set these_items to the text items of theDescription
               set AppleScript's text item delimiters to ""
               set theDescription to these_items as string

Nigel_Garvey · February 10, 2012, 11:53pm

Sorry. I missed that. It looks like a “'” on casual inspection. :rolleyes:

I haven’t been able to study the minutiae of Chocoholic’s script today, but I did give it a couple of comparative test runs against the latest version of my vanilla script (now the only script in post #31). I sandwiched them in some timing code and set them up to start new databases and populate them with 400 pages’ worth of data.

On their initial runs, they both finished within a few microseconds of 10 minutes and 3 seconds.

On an immediately following update run, the vanilla one predictably took several seconds longer. Chocoholic’s script actually took less time on the update run. (I’m afraid I didn’t keep these times.)

I ran them both again earlier this evening and they both took longer ” no doubt because of the Net traffic at that time of day. The vanilla took 12:07 and reported two price drops; Chocoholic’s took 15:02 and didn’t report any drops. I don’t know which was right.

Nigel_Garvey · February 11, 2012, 10:59am

Ah. Both are probably right. Chocolic’s script doesn’t update any records modified less than twenty-four hours previously, so it’s not due to notice any changes for a couple of hours yet (as I write). I’ll try it again later.

I like the fact that sqlite3 can bulk-transfer values from one set of fields to another and can bulk-compare them to pick out any changes. That’s got to be better than the script checking records one-by-one.

An interesting read.

Nigel_Garvey · February 11, 2012, 2:49pm

No. It didn’t notice any price drops this time either ” unless the prices have gone up again!

I think the problem this time is that while “new” prices are only transferred to the Old_price fields when they’re over a day old, they’re overwritten anyway by the data from the Web pages. So if the script’s run less than a day after the previous time ” as when I ran it last night ” the old “new” prices are lost.