Parsing HTML from a Web Site

Nova Scotians are intensely interested in the weather, of which we have an abundance. “If you don’t like the weather, wait 5 minutes” is our mantra. This extreme variability is caused by the jet stream which meanders over our heads on its way out over the Atlantic Ocean from North America. When it is south of us, we freeze in arctic air and when north of us the weather moderates. Further, the jet stream drags the weather from much of North America to us so we experience what others have been getting a few days after they do. [center]

[/center]
Wanting a quick summary of American weather, I wrote a script to present the conditions and forecasts for a number of cities in the USA, and it occurred to me when I got it running, that it was a good exercise in parsing data from the source text of a web site; an educational subject for a tutorial here with much broader implications than just grabbing the weather. Many of us return regularly to web sites to follow the market, get exchange rates, check the weather, check on flights, check on deals, etc., so this example is just a framework for doing that with an AppleScript.

The first step in choosing a site for yourself, however, is to make sure that all the information you want is available in the source text of the site. In the browser of your choice, view the source, and in that window search for some key words that you see on the page near what you want to know. Make sure that the data you want is not the output of a PHP/SQL interaction, but is actually visible in the text. For this example, there was another requirement – I wanted to find a site from which I could get the weather in a lot of cities and it would always be in the same format. I discovered that the Boston Globe is such a site and even better, weather.boston.com had a Weather Finder section in which a simple code revealed the weather in all the cities I cared about. What’s more, it was “plain vanilla” CSS and HTML. Current conditions for all cities are in the same format as shown in the figure below for Boston, MA.

Figure 1: Typical Current Conditions Report weather.boston.com

Searching in the source text for that page, I found the section shown in Figure 2 (Click on the image to enlarge it to readable size). Notice that the line for the div id “currentCond” would be a good place to start looking for the results and a quick search (using your browser’s Find) will reveal that it’s unique. Further note that the data we want for the current conditions somewhere always begin with “wNumbers”> and if you check other locations (cities) you’ll see that starting point and the order of the data is constant.

Figure 2: HTML corresponding to Current Conditions (click to enlarge).

Further down the page for any city’s weather, a simple Five-Day Forecast is shown, starting with the forecast for “today”, and using both images and words to describe the day’s conditions and the temperature highs and lows. Figure 3 below is a sample (again, click to enlarge).

Figure 3: Typical 5-Day forecast section at weather.boston.com.

With a little more poking around in the HTML for the site, we find two relevant sections; one that lists the forecast conditions (text that appears just below the images for each day, and one that lists the temperature hi/lo values for each day. Figures 4 and 5 show them. The first set, in Figure 4, occurs shortly after a line in the code that reads: “class=“forecast”>” with forecast quoted, but without the enclosing quotes.

Figure 4: Typical Code for 5-Day Conditions (under weather icons).

Figure 5: Typical Code for 5-Day Hi/Lo Temperatures.

At this point, we are ready to use AppleScript’s Text Item Delimiters to parse the HTML for the parts we want. Parsing, for those not familiar with the term is the process of analyzing a sequence of tokens of some kind in text to determine its grammatical structure with respect to a given grammar. A parser is the component of a compiler that carries out this task. Parsing transforms input text into a data structure – in AppleScript, a set of lists of the text items separated by the text item delimiters. If all of this is still fuzzy, stop here and read the article in MacScripter.net/unScripted: A Tutorial for Using AppleScript’s Text Item Delimiters

If you are comfortable with text item delimiters, then the best way to proceed is with a heavily commented AppleScript to parse the HTML for 10 cities in the United States all reached through weather.boston.com so they will have exactly the same format. Rather than trying to read the text from the source text in a browser, we will use the unix function cURL to grab it. Curl is very powerful, and rather complex. You can see it’s man page (if you care to) by running this script.

do shell script "man -t curl | open -f -a /Applications/Preview.app"

And now… Here’s the main event; a heavily commented script.

-- Begin --
set astid to AppleScript's text item delimiters -- for later reference by variable. It is always good practice to store these and finally set them back. AppleScript's text item delimiters are a global property of AppleScript so if you leave them in a strange setting, other scripts open at the same time will respond to them.
set WkDays to getWeekdays(current date, 5, true) -- a handler (at the bottom) to compute the days of the week for the forecast, rather than finding them from the site source.
set Weather to "American Cities Weather Summary" & return -- this is the string that will become the complete summary dialog text.
set Forecast to "5-Day Forecasts for American Cities" & return & "---" & return -- this is the string that will start the 5-day forecast dialog text -- but it's too long for 10 cities to fit on one dialog, so:
set FCPage2 to "5-Day Forecasts Continued," & return & "---" & return -- the dialog for forecasts for the last five cities.
set Cities to {"Boston", "New York City", "Wash. DC", "Miami", "Chicago", "Denver", "Houston", "San Francisco", "San Diego", "Seattle"} -- these are spread out all over the USA
-- City codes in the same order as the list of cities above -- modify to suit yourself. They appear in the URL to get the weather for each city.
set Codes to {"BOS", "NYC", "DCA", "MIA", "ORD", "DEN", "IAH", "SFO", "SAN", "SEA"}
-- Offer a choice of cities using the handler to get back the positions of the choices in the lists.
set ChosenCities to getChosen(Cities)
(*
You can look up more codes by going to [url=http://weather.boston.com/]http://weather.boston.com/[/url]
and using the search boxes on the right. When you choose one, the
code will appear at the end of the URL for the page. A more complete
list is at: [url=http://stuff.mit.edu/doc/weather-cities.txt]http://stuff.mit.edu/doc/weather-cities.txt[/url] Searching for
codes by city in that list is a fast way to find lesser known Cities in
the USA. (Canadian codes are not included in either of these, but
Google is your friend if you want to try some of them. I didn't
explore codes for cities elsewhere in the world if they exist.)
*)
-- Start of code to parse the weather.boston.com site.
-- set some text item delimiter definitions we'll need.
set cond_start to "div id=\"currentCond\"" -- chop off the top of the HTML
set Parts to "<span class=\"wNumbers\">" -- general Partlocator for data
set Fore to "id=\"fiveForecast\"" -- place to start looking for 5-day forecast.
set pages to 1 -- used for displaying forecasts when the number of choices is 5 or more.
-- Start of the main loop to cycle through the cities one by one.
set Ccount to count ChosenCities
-- Check that something was chosen, then extract the data
if Ccount is not 0 then
	repeat with k from 1 to Ccount -- the main loop for dealing with chosen cities 
		set CityChoice to item k of ChosenCities
		-- placeholders inside the loop so they'll be reset for each city.
		set WC to {} -- a placeholder for weather conditions in a city.
		set WF to {} -- a placeholder for forecast conditions in a city.
		set HiLo to {} -- a placeholder for daily high/low temperatures
		set CtyCode to item CityChoice of Codes
		set City to item CityChoice of Cities
		-- Use cURL to download the HTML for the page. This is the major time
		-- consumer because we are looking up 10 cities in this example.
		set T to (do shell script "curl [url=http://weather.boston.com/?code="]http://weather.boston.com/?code="[/url] & CtyCode)
		
		-- CURRENT CONDITIONS EXTRACTION for EACH PAGE
		-- Now parse the data for each. The advantage of using a single site
		-- for all of them becomes clear: the format is the same for all of them.
		set AppleScript's text item delimiters to cond_start -- to chop off most of the front end we'll keep everything after this.
		set lastPart to text item 2 of T -- keep the last part, where the action is.
		set AppleScript's text item delimiters to "alt=" -- look for the conditions image's alternative text
		set Y to text item 2 of lastPart -- grab what's after alt=
		set AppleScript's text item delimiters to " height" -- look for what follows the conditions
		set Conds to text item 1 of Y -- the text for current conditions, eg: "Mostly Cloudy"
		set AppleScript's text item delimiters to "class=\"degrees\">" -- look for temperature next
		-- NOTE: when the astid contains quotes, they must be escaped as they are above.
		set X to text item 2 of lastPart -- grab all of it
		set AppleScript's text item delimiters to "°" -- look for what follows the number
		-- NOTE: your browser will translate the astid above to a degree symbol. You should replace that with ampersand, "&", followed by the letters "deg", and then by a semicolon. ";" if this HTML code for a degree symbol is not in the script you download.
		set tTemp to text item 1 of X -- the first part is the temperature number we want.
		set AppleScript's text item delimiters to Parts -- now move down to the numbers for conditions
		set tItems to (text items 2 thru -1 of lastPart) -- five of them, ignore the rest -- we start with item 2 because there is "stuff" after the first one we don't want.
		-- grab the pieces from the data set
		repeat with anItem in tItems -- cycle through conditions and move data to our storage.
			set end of WC to first word of contents of anItem -- individual figures
		end repeat -- end of grabbing data for conditions.
		set AppleScript's text item delimiters to astid
		set {feel, tWind, Dir, tHum, Bar} to WC -- give the five items variable names for easy reference in building the dialog text (instead of item 1 of..., item 2 of ....)
		
		-- Build the dialog display for current conditions (ASCII char 188 is a degree symbol)
		-- NOTE: if the display is too tall for your screen, remove some cities and their codes
		-- from the Cities list and CtyCode list definitions at the beginning of the script.
		set Weather to Weather & "---" & return & City & ", " & Conds & " and " & tTemp & (ASCII character 188) & "F" & ", " & "Feels: " & feel & (ASCII character 188) & "F" & return & Dir & " Wind " & " @ " & tWind & " mph " & "with " & tHum & "% Rel. Humidity" & return & "Barometric Pressure " & Bar & " in." & return
		set Weather to Weather & "---" -- to bound the last entry.
		
		-- 5-DAY FORECAST EXTRACTION FROM EACH PAGE
		-- Now build the forecast summary for the cities; refer to the forecast figure.
		-- We want the words under the images, not the images themselves, and
		-- because we know what day it is, we don't need the text for the weekdays.
		-- We will calculate the days of the week require using a handler.
		set AppleScript's text item delimiters to Fore -- a unique indicator in the site code.
		set partTwo to text item 2 of lastPart -- forecast data is after our first cut above.
		-- We could have started from the html itself (the variable "T"), 
		-- but moving down reduces the searching (not that TID searches are slow).
		set AppleScript's text item delimiters to "class=\"forecast\">"
		set Z to text items of partTwo
		set AppleScript's text item delimiters to "</td>" -- the end of each forecast item
		repeat with m from 2 to 6
			set end of WF to first text item of item m of Z
		end repeat -- end of extracting forecast data
		set AppleScript's text item delimiters to "class=\"hightemp\">"
		set Q to text items 2 thru 6 of partTwo
		set AppleScript's text item delimiters to "class=\"lowtemp\">"
		set R to text items 2 thru 6 of partTwo
		set AppleScript's text item delimiters to "°" -- follows the temperatures.
		-- See the warning above about what the text item delimiter should be.
		repeat with n from 1 to 5 -- 5-day forecasts
			set Hi to text item 1 of item n of Q
			set Lo to text item 1 of item n of R
			set HiLo's end to "" & Hi & "/" & Lo & space & (ASCII character 188) & "F"
		end repeat -- end of repeat through forecast days
		set AppleScript's text item delimiters to astid
		-- build the Forecast text using items from WkDays, WF (the conditions), & HiLo
		set Forecast to Forecast & "For: " & City & return -- heading for each city
		-- add the five day conditions & hi/lo temperatures
		repeat with m from 1 to 5 -- Pick out the data from lists
			set WD to item m of WkDays
			set dayFCst to item m of WF
			set dayHiLo to item m of HiLo
			set Forecast to Forecast & WD & ": " & dayFCst & " Hi/Lo =  " & dayHiLo & return
		end repeat -- end of building forecasts list
		set Forecast to Forecast & "---" & return
		if k = 5 then -- 10 cities won't fit on one dialog box, so we'll do 5 and 5.
			copy Forecast to FCPage1 -- first "page"
			copy FCPage2 to Forecast -- second "page"
			set pages to 2
		end if
	end repeat -- end of repeat through the chosen cities
	
	-- display Weather with button choice for displaying forecast as well.
	set B to button returned of (display dialog Weather buttons {"Done", "Forecast"} default button "Forecast")
	if B is "Forecast" and pages = 2 then -- show the Forecast data if required.
		if button returned of (display dialog FCPage1 buttons {"Cancel", "More"} default button "More") is "More" then display dialog Forecast -- now the second page from above.
	else
		display dialog Forecast buttons {"Done"} default button 1
	end if
end if -- end of check for something chosen.

-- Handler to list some weekdays from a date including the day of that date if startToday is true.
-- For this example, the forecast lists do start with today, so we set it to true.
to getWeekdays(aDate, howMany, startToday) -- startToday is true/false, false is tomorrow.
	set startnum to 1 -- start list from tomorrow
	if startToday then set startnum to 2 -- starts the list with today's weekday
	-- Note that weekday names are AppleScipt constants.
	set WkDays to {Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday}
	set WkDy to (weekday of aDate) as number
	set WD to {}
	-- note fiddle because there is no weekday numbered 0, they start with Sunday = 1
	repeat with k from 1 to howMany
		set WD's end to item (((WkDy + k - startnum) mod 7) + 1) of WkDays as text
	end repeat
	return WD
end getWeekdays

-- Handler to return the position of an item in a list of choices so an 
-- item in a companion list can be substituted for the choice. In this example,  
-- a City is chosen from a list of cities, but a City's code is also required.
-- The handler returns a list of numbers which is used to get both the
-- cities and the corresponding city codes for the weather.
to getChosen(aList)
	set numList to {}
	set chosen to {}
	-- Make a numbered list in "numList".
	repeat with k from 1 to count aList
		set end of numList to (k as text) & ". " & item k of aList -- number the original list.
	end repeat
	-- Offer a choice from the numbered list.
	set choices to choose from list numList with prompt "Hold the Command key down to make multiple choices." with multiple selections allowed -- get the picks now including numbers.
	-- Extract the numbers only from the choices found in list "choices".
	repeat with j from 1 to count choices
		tell choices to set end of chosen to (text 1 thru ((my (offset of "." in item j)) - 1) of item j) as integer -- look for the period in the numbered list, and take the characters before it.
	end repeat
	-- Return a list of numbers for chosen positions in the given list.
	return chosen
end getChosen

I’ve used the weather in 10 American cities as the example here, but if you’ve followed the process, you’ll be able to extract any data from a web page if the data you want appears in the source HTML for the page. Build your own notification and enjoy.

Thank you for this tutorial on how to build my own application, and also for the free weather lesson :stuck_out_tongue:

Nice tutorial, Adam.

If you are a little more adventuresome, the same results could be accomplished with a lot less effort if you are a bit familiar with ruby and ruby gems.

See tutorial here:
http://asciicasts.com/episodes/190-screen-scraping-with-nokogiri

This method could be used with an Applescript ‘do shell’ to run the ruby part, which would eliminate all of the text item gyrations you need for AS to do it. Using the “SelectorGadget” bookmarklet means that you don’t even have to look at the HTML code…you can just click on the section of the page you are interested in to get the CSS selector that will deliver the portion of the page you want. Easy…I like easy.

Vince

Adam - any chance that you might be able to help me with this little problem? Has me completely stumped and I’m sure that you will be able to solve it for me…!

http://macscripter.net/viewtopic.php?id=36185

Thanks very much.

Ed

Hi, I’m having trouble parsing the source of a Web page and need some help.

In essence, I want the blurb after “Tonight” to return as text. Here is my code, which only seems to return the whole source.

set astid to AppleScript's text item delimiters

set startHere to "Tonight</div>"
set stopHere to "</div><div>"

set blurb0 to (do shell script "curl [url=http://www.wund.com/global/stations/71508.html)]http://www.wund.com/global/stations/71508.html")[/url]
set AppleScript's text item delimiters to startHere
set blurb1 to text items of blurb0
set AppleScript's text item delimiters to stopHere
set blurb2 to text items of blurb1

set AppleScript's text item delimiters to astid

blurb2

What am I doing wrong?

Model: MacBook Pro 13" Lion
AppleScript: 2.2.1
Browser: Safari 534.49
Operating System: Mac OS X (10.6)

Hi,

to extract a portion of text between start and end point specify the text item you want to keep


set astid to AppleScript's text item delimiters

set startHere to "Tonight</div>"
set stopHere to "</div><div>"

set blurb0 to (do shell script "curl [url=http://www.wund.com/global/stations/71508.html)]http://www.wund.com/global/stations/71508.html")[/url]
set AppleScript's text item delimiters to startHere
set blurb1 to text item 2 of blurb0
set AppleScript's text item delimiters to stopHere
set blurb2 to text item 1 of blurb1

set AppleScript's text item delimiters to astid

blurb2


Hi Stefan, thanks for the advice. Each day the blurb I want to keep will be different. So it makes more sense to me to find the div tags that hold it.

Any other suggestions?

use textutil to strip all html tags


set startHere to "Tonight" & return

set blurb0 to (do shell script "curl http://www.wund.com/global/stations/71508.html  | textutil -stdin -stdout -format html -convert txt -encoding UTF-8 ")
set TID to text item delimiters
set text item delimiters to startHere
set blurb1 to text item 2 of blurb0
set text item delimiters to TID
set blurb2 to paragraph 1 of blurb1


Awesome! Thanks Stefan.

For a more global oriented script instead of US only you can use the google weather api. Because this is an api the returned data is XML an can be parsed easily with System Events. A small example how you can use the current state…


set theCity to "Amsterdam"
set rawXMLData to do shell script "curl " & quoted form of ("http://www.google.com/ig/api?weather=" & theCity)

tell application "System Events"
	set xmldata to make new XML data with data rawXMLData
	set xml_api_reply to XML element "xml_api_reply" of xmldata
	set theWeather to XML element "weather" of xml_api_reply
	--{"forecast_information", "current_conditions", "forecast_conditions", "forecast_conditions", "forecast_conditions", "forecast_conditions"}
	
	set a to value of XML attribute "data" of XML element "condition" of XML element "current_conditions" of theWeather
	set b to value of XML attribute "data" of XML element "temp_c" of XML element "current_conditions" of theWeather
	set c to value of XML attribute "data" of XML element "humidity" of XML element "current_conditions" of theWeather
	set d to value of XML attribute "data" of XML element "wind_condition" of XML element "current_conditions" of theWeather
	
	set currentConditions to {condition:a, temp:b, humidity:c, wind:d}
end tell
return currentConditions 

Awesomer! :smiley:

Indeed it is great stuff – really fast for one thing. :slight_smile:

I have made an adjustment (because I’m a geek):

Save this script to a convenient place


to respondYes()
	set responseList to {"Certainly sir", "Indeed sir.", "On the double sir.", "Right away sir."} as list --your initial list values
	set the listCount to the count of responseList
	set pick to random number from 1 to listCount
	set responseText to item pick of responseList as string
	say responseText
end respondYes

Point (the first line of) this script to that convenient place.


set responsder to (load script alias ":path:to:file.scpt")
tell responsder to respondYes()

set theCity to "Toronto"
set rawXMLData to do shell script "curl " & quoted form of ("http://www.google.com/ig/api?weather=" & theCity)

tell application "System Events"
	set xmldata to make new XML data with data rawXMLData
	set xml_api_reply to XML element "xml_api_reply" of xmldata
	set theWeather to XML element "weather" of xml_api_reply
	--{"forecast_information", "current_conditions", "forecast_conditions", "forecast_conditions", "forecast_conditions", "forecast_conditions"}
	
	set a to value of XML attribute "data" of XML element "condition" of XML element "current_conditions" of theWeather
	set b to value of XML attribute "data" of XML element "temp_c" of XML element "current_conditions" of theWeather
	set c to value of XML attribute "data" of XML element "humidity" of XML element "current_conditions" of theWeather
	set c2 to text 11 thru 13 of c
	set d to value of XML attribute "data" of XML element "wind_condition" of XML element "current_conditions" of theWeather
	
	set currentConditions to {condition:a, temp:b, humidity:c2, wind:d}
	
	set theCondition to "The current weather for " & theCity & " is " & a & ", at " & b & " degrees. Humidity at " & c2
	say theCondition
	
end tell

return currentConditions