Extracting different number values from HTML text

Hi everybody

I have a bit of a task to do here, and i just can’t figure out how to extract certain values from a huge HTML string, that i fetch from a website through CURL.

So here is my basic setup:

I do some curl code to sign in to the website with a PHP Session cookie, and then use that cookie to open the secured site, and perform a search on that site. I can’t show you the site due to business regulations, but you can think of it as a database system, where you can search for a number in the format 1.123.123 and it then shows you search results of assets with that number. If you click on the one asset or on one of the assets (if it finds multiple ones), you can then edit stuff for that asset. Now the problem is that the Number i am searching for is not the actual number that the tool is using to change asset options, it uses a unique “id” for the asset. Quick example:
Search URL: http://www.domain.com/something/search/index/query/1.123.123
shows up search results, and once i click on the asset it loads up the URL to change options:
Asset URL: http://www.domain.com/something/search/update/id/25756
So 25756 is the unique ID number, which i wanna open directly, without having to search for the asset first. So i am in need for a script that signs in to the database through CURL, opens the search URL (1.123.123), fetches the ID (25756) from the search results and then opens a link with that fetched ID (25756)

(25756) is just an example number ofcourse.

I already have the part that searches for me on the website and stuff, and i have the result in HTML, so i need a script that looks through that ton of code, and extracts only that “ID” value!

This is an example of code fetched from the search result site:

"<table cellspacing="0" cellpadding="0"><thead><tr><td style="width: 90px;"><a href="/something/search/index/query/1.123.123/sort-search/%7B%22col%22%3A%22sap-number%22%2C%22dir%22%3A%22asc%22%7D" class="sort-desc" title="Sort ascending">SAP number</a></td><td style="width: 60px;"><a href="/something/search/index/query/1.123.123/sort-search/%7B%22col%22%3A%22week%22%2C%22dir%22%3A%22asc%22%7D" class="sort-desc" title="Sort ascending">Week</a></td><td style="width: 100px;"><a href="/something/search/index/query/1.123.123/sort-search/%7B%22col%22%3A%22category%22%2C%22dir%22%3A%22asc%22%7D" class="sort-desc" title="Sort ascending">Category</a></td><td style="width: 70px;"><a href="/something/search/index/query/1.123.123/sort-search/%7B%22col%22%3A%22hc%22%2C%22dir%22%3A%22asc%22%7D" class="sort-desc" title="Sort ascending">HC</a></td><td style="width: 269px;"><a href="/something/search/index/query/1.123.123/sort-search/%7B%22col%22%3A%22name%22%2C%22dir%22%3A%22asc%22%7D" class="sort-desc" title="Sort ascending">Name</a></td><td style="width: 90px;"><a href="/something/search/index/query/1.123.123/sort-search/%7B%22col%22%3A%22added-on%22%2C%22dir%22%3A%22desc%22%7D" class="sort-active sort-asc" title="Sort descending">Added on</a></td><td style="width: 150px;"><a href="/something/search/index/query/1.123.123/sort-search/%7B%22col%22%3A%22state%22%2C%22dir%22%3A%22asc%22%7D" class="sort-desc" title="Sort ascending">State</a></td><td style="width: 240px;"><div class="header-actions">View:  <a href="/something/search/index/query/1.123.123/view-search/list" title="Show list view"><img style="margin-right: 3px;" src="/something/public/images/icons/list.png" alt=""/></a><a href="/something/search/index/query/1.123.123/view-search/detailed" title="Show detailed view"><img style="margin-right: 3px;" src="/something/public/images/icons/detailed-inactive.png" alt=""/></a>    Export:  <a class="newWindow" href="/something/table/print/tableId/search/query/1.123.123" title="Print complete table"><img style="margin-right: 3px;" src="/something/public/images/icons/print.png" alt=""/></a><a href="/something/table/mail/tableId/search/query/1.123.123" title="Mail me complete table"><img style="margin-right: 3px;" src="/something/public/images/icons/mail.png" alt=""/></a></div>Actions</td></tr></thead><tbody><tr class="tr-odd"><td style="width: 90px;">1.123.123</td><td style="width: 60px;">28/2014</td><td style="width: 100px;">Some Category</td><td style="width: 70px;">59</td><td style="width: 269px;"><a href="/something/search/update/id/25756">Description</a><ul class="images images-icon"><li><a href="#"><img style="margin-right: 3px;" src="/something/public/images/icons/image.png" alt=""/><span class="preview"><img src="/something/image/image/id/25756/name/53a0083ac7103/mode/preview" alt="" /></span></a></li></ul><br style="clear: both" /><p>Description</p><div id="remarks-25756" class="remarks remarks-empty"><a href="javascript:;" title="Click to add">Add remarks</a><div class="loading"><div class="loading-background"></div><div class="loading-activity"></div></div></div></td><td style="width: 90px;">09.05.2014, 11:29</td><td style="width: 150px;"><span class="state state-6"><img style="margin-right: 3px;" src="/something/public/images/icons/state-6-dark.png" alt=""/>Status</span></td><td style="width: 240px;"><ul class="actions"><li><a href="javascript:;" onclick="App.util.confirmForm('deleteJob25756','Really delete job \'Description\'?');"><img style="margin-right: 3px;" src="/something/public/images/icons/delete.png" alt=""/>delete</a><form method="post" id="deleteJob25756" action="/something/search/delete"><div><input type="hidden" name="id" value="25756" /></div></form></li></ul></td></tr></tbody></table>"

You see that the “ID” shows up a few times in the code, so it doesn’t actually matter where i would extract it from. The most interesting part is this one

<a href="/something/search/update/id/25756">Description</a>

because that is pretty much exactly the link i then want to open in Safari (but i don’t need the direct link, if i could just extract the ID (25756 in this case) i can do all sorts of fancy stuff with that asset through applescript.

So the question is, how can i get Applescript or through a “do shell script”, to extract and isolate that value number from the code above? And if there are multiple search results, there will be multiple IDs, and i would need to isolate the last one, the bottom one of the code, as curl/the website sorts the newest one on the bottom.

Thank you for your answers and help!

Regards
Ultra

AppleScript is not the best html parser. Ruby has Nokogiri, which is written in C so it is very fast and easy to use.
You might also want to look into Mechanize → https://github.com/sparklemotion/mechanize
Which allows you to log in and navigate a site, pull down html, etc.

Passing the html you have posted above into this will return “25756”

You will have to install Nokogiri using

If you are using your system Ruby (You are unless you know you are using something else)

Save this to a file “parse_id.rb”

Call it from AppleScript


set html_content to do shell script "curl http://your.site.com"
set theId to do shell script "ruby parse_id.rb" & space & html_content

As an old fashioned AppleScript user, I did the job with vanilla AppleScript.

# For tests, the grabbed datas are stored in a text file
set theDatas to read file ((path to desktop as text) & "forsee.txt")

# split the datas on the first occurence of "/id/"
# and keep the second item
set temp to item 2 of my decoupe(theDatas, "/id/")
(*25756">Description</a><ul class="images images-icon"><li><a href="#"><img style="margin-right: 3px;" src="/something/public/images/icons/image.png" alt=""/><span class="preview"><img src="/something/image/image*)
# split the temp string on the first occurence of quote
item 1 of my decoupe(temp, quote) --"\"")
--> "25756"


#=====

on decoupe(t, d)
	local oTIDs, l
	set {oTIDs, AppleScript's text item delimiters} to {AppleScript's text item delimiters, d}
	set l to text items of t
	set AppleScript's text item delimiters to oTIDs
	return l
end decoupe

#=====

It’s also possible to use egrep to get all instances of “/id/” & the number and sed to return just the number from the first instance:

-- Using this MacScripter topic page for testing:
do shell script "curl 'http://macscripter.net/viewtopic.php?id=42712' | egrep -o '/id/[0-9]+' | sed -n '1 s|/id/||p'"
--> "25756"

Thank you all for your great replys!! I found out a little feature in Applescript called text offset, so that gives me the option to extract the ID no matter where it is in the text through the delete option of the website:



set thetext to "<table cellspacing=\"0ml;ssen, 3 Paar</p><div id=\"remarks-25756\" class=\"remarks remarks-empty\"><a href=\"javascript:;\" title=\"Click to add\">Add remarks</a><div class=\"loading\"><div class=\"loading-background\"></div><div class=\"loading-activilic/images/icons/delete.png\" alt=\"\"/>delete</a><form method=\"post\" id=\"deleteJob25756\" action=\"/random/search/delete\"><div><input type=\"hidden\" name=\"id\" value=\"25756\" /></div></form></li></ul></td></tr></tbody></table>"

set theoffset to offset of "deleteJob" in thetext
set theaddoffset to theoffset + 13
set cutPosition to (offset of "deleteJob" in thetext) + 9
return text cutPosition thru theaddoffset of thetext


This works great so far, now i need a way so that CURL strips all the " from the HTML code, because it doesn’t work with them, or you have to backslash them like in the example above, anyone know how to strip the " from the text?

Try to save the datas in a text file then read it as I did.
Doing that, I had no problem with the quotes.
It’s when we need to set the value of a variable to a large block of text that embedded quotes are annoying.

Here is a script replacing straight quotes by curly ones which don’t require backslashing.

# For tests, the grabbed datas are stored in a text file
set theDatas to read file ((path to desktop as text) & "forsee.txt")
# Replace the quotes by curly ones which aren't requiring to be backslashed
set theDatas to my remplace(theDatas, quote, """)

#=====
(*
replaces every occurences of d1 by d2 in the text t
*)
on remplace(t, d1, d2)
	local oTIDs, l
	set {oTIDs, AppleScript's text item delimiters} to {AppleScript's text item delimiters, d1}
	set l to text items of t
	set AppleScript's text item delimiters to d2
	set t to "" & l
	set AppleScript's text item delimiters to oTIDs
	return t
end remplace

#=====

Your sample set of datas become :

“”

SAP number Week Category HC Name Added on State
View: Export:
Actions
1.123.123 28/2014 Some Category 59 Description

Description

09.05.2014, 11:29 Status
“”

Yvan KOENIG (VALLAURIS, France) samedi 21 juin 2014 19:20:56

How fast is this way? Because what i am doing is i am selecting 1-40 files in the finder, and have to run that script through Spark.app, and it then isolates the number from the Filename (1.123.123), and then runs the script above in this thread, so my question is how long it would take and how complex it is gonna be if i have to write the code for every single run into a file and read it again, because i am using this script around 100-150 times a day.

Just so you know what the script is doing so far:


tell application "Finder"
	set selectedItems to selection as list
	repeat with item1 in selectedItems
		tell application "Finder"
			set Thename to name of item1 as text
			set thechars to characters 1 thru 9 of Thename as text
			set the clipboard to thechars
		end tell
		tell application "Safari"
			activate
			open location "http://www.mydomain.abc/random/search/index/query/" & thechars
		end tell
	end repeat
end tell

So Safari opens up 1-40 tabs (depending on how many files i have selected) and shows search results, and instead of having to click manually on a search result i just want it to go directly to the corresponding search entry by using the isolated id number. I have said that before, but i am summarizing it here again.

So i think i would need an easier way to get rid of all the special chars on the fly in applescript…