HTML to CSV

pendolo · September 18, 2012, 7:59pm

Hi, I was wondering if it’s possible to have a script do the following:

Scan a folder for HTML or PHP files
Look inside the of the files
Extract all text in alt, p, span, h tags
“ Put all text items into an Excel file in single cells of a column
Add a few more standard template columns to the Excel file

Feasible?

DJ_Bazzie_Wazzie · September 18, 2012, 9:56pm

That would be quite easy with some regex.

Scan a folder for HTML or PHP files
Shell command find or mdfind can help you with this
Look inside the of the files
Extract all text in alt, p, span, h tags
could be done with a offline JavaScript runner or with regex

“ Put all text items into an Excel file in single cells of a column
Can be saved into CSV directly (without Excel) or copied into cells in open document in Excel

Add a few more standard template columns to the Excel file
is as easy as adding normal dat, because a header in csv format is interpreted as a normal cell.

pendolo · September 18, 2012, 10:13pm

Great, hmm could anyone help with the code? I’m not too familiar with it
Thanks

adayzdone · September 19, 2012, 12:33am

Hi pendolo,

There are lots of folks here who will be happy to help. Rather than asking for an entire script, post your code so far. Even if there is not much to show, it will show some effort!

pendolo · September 19, 2012, 11:09am

Eheh I know, I’m trying to find similar scripts that could be adapted

Basically, the desire output is an Excel file looking like this:
http://i45.tinypic.com/14e8diq.png

DJ_Bazzie_Wazzie · September 19, 2012, 1:58pm

Agree, I’m on holiday so I’ll pass this week . To get text between two words (tags) with sed or awk is quite easy.

do shell script "cat $HOME/Sites/index.html | sed -n '/<body>/,/<\\/body>/p'"
--or awk 
do shell script "cat $HOME/Sites/index.html | awk '/<body>/,/<\\/body>/'"

McUsr · September 19, 2012, 2:14pm

But what do your uniform input look like? Do you think you could post the structure of a html file?
No data necessary, just the elements, and a note about which elements that repeats if any.

pendolo · September 19, 2012, 2:29pm

Example could be any page, let’s say: http://www.apple.com/iphone/

Looking at the source in Chrome, basically all the black text inside the tags. Maybe … ? And alt=“xx”?

pendolo · September 19, 2012, 4:56pm

Maybe some HTML like this:

Apple - iPhone 5 - The thinnest, lightest, fastest iPhone ever.

<div id="main">
	<div class="content" id="content" data-hires="true">
		<div class="autogallery autogallery-slide-directional autogallery-slideshow-hero gallery flushleft flushright" data-hires="false" id="gallery-hero-videos">
			<div id="gallery-hero-videos-default" class="gallery-hero-videos-default">
				<div class="gallery-view" id="hero-gallery">
					
					<figure class="gallery-content gallery-hero" id="gallery-hero">
						<span class="block"/>
							<div class="image"></div>
							<p class="title">iPhone 5. The biggest thing to happen to iPhone since iPhone.</p>
						</span>
					</figure>
					
					<figure class="gallery-content gallery-design" id="gallery-design">
						<span class="block"/>
							<div class="image"></div>
							<span class="title">Thinner, lighter design. So much more than before. And so much less, too.</span>
						</span>
					</figure>
					<figure class="gallery-content gallery-display" id="gallery-display">
						<span class="block"/>
							<div class="image"></div>
							<h2 class="title">4-inch Retina display. It's not just bigger. It's just right.</h2>
						</span>
					</figure>
					<figure class="gallery-content gallery-wireless" id="gallery-wireless">
						<span class="block"/>
							<div class="image"></div>
							<h1 class="title">Ultrafast wireless. Browse, download, and stream content at blazing-fast speeds.</h1>
						</span>
					</figure>
					<figure class="gallery-content gallery-a6" id="gallery-a6">
						<span class="block"/>
							<div class="image"></div>
							<h1 class="title">A6 chip. Performance and graphics up to twice as fast.</h1>
						</span>
					</figure>
					<figure class="gallery-content gallery-camera" id="gallery-camera">
						<span class="block"/>
							<div class="image"></div>
							<h1 class="title">iSight camera. The camera you love now shoots in panorama.</h1>
						</span>
					</figure>
					<figure class="gallery-content gallery-ios" id="gallery-ios">
						<span class="block"/>
							<div class="image"></div>
							<h1 class="title">iOS 6. The world's most advanced mobile operating system.</h1>
						</span>
					</figure>
				</div>

			</div>
		</div>

	</div><!--/content-->
</div><!--/main-->

To output a spreadsheet like this:

type segmentCode geoCode formatCode languageCode keyPath description isPublished isOneOff value
Text all ww common en .1 description 1 TRUE FALSE Apple - iPhone 5 - The thinnest, lightest, fastest iPhone ever.
Text all ww common en .2 description 2 TRUE FALSE iPhone 5. The biggest thing to happen to iPhone since iPhone.
Text all ww common en .3 description 3 TRUE FALSE Thinner, lighter design. So much more than before. And so much less, too.
Text all ww common en .4 description 4 TRUE FALSE 4-inch Retina display. It’s not just bigger. It’s just right.
Text all ww common en .5 description 5 TRUE FALSE Ultrafast wireless. Browse, download, and stream content at blazing-fast speeds.
Text all ww common en .6 description 6 TRUE FALSE A6 chip. Performance and graphics up to twice as fast.
Text all ww common en .7 description 7 TRUE FALSE iSight camera. The camera you love now shoots in panorama.
Text all ww common en .8 description 8 TRUE FALSE iOS 6. The world’s most advanced mobile operating system.

McUsr · September 20, 2012, 12:12pm

Hello

I’ll see if get the time to start this tonight, what I will have to offer you, is a handler, that transforms the structure of the html to lines with hyphenated separated columns.

If there are class elements in id tags, those will also be put into a column, I’m not sure what I can manage to do with id tags without class elements at the moment.

I will not regard that the structure will be even and nice as on that page. If the structure are skewed in the sense that there is an id tag within with a deeper structure than within, then there will be up to you to fix that in the spreadsheet.

I will also only provide you with the handler to read an html file as utf8 from your desktop or something.

Yvan_Koenig · September 20, 2012, 3:48pm

Maybe this code may help :


set rapport to "wasHtml_lmtHsaw."
set p2d to path to desktop as text
set cheminTexte to p2d & rapport & "txt"
set cheminCsv to p2d & rapport & "csv"

character 2 of (0.5 as text)
if result = "." then
	set delim to ","
else
	set delim to ";"
end if
set oTIDs to AppleScript's text item delimiters
set AppleScript's text item delimiters to ""
--set fichierHtml to ((path to desktop as text) & "bts-emea-geo-mac-landing-page.html") as alias
set fichierHtml to choose file of type {"public.html"}
set htmlBrut to read fichierHtml
# Init a list with the header row
set enListe to {my recolle({"type", "segmentCode", "geoCode", "formatCode", "languageCode", "keyPath", "description", "isPublished", "isOneOff", "value"}, delim)}
# extract the language ID and insert it in the row template
set rowTemplate to {"", "", "", "", "", "", "", "", "", ""}
item 2 of my decoupe(htmlBrut, " lang=" & quote)

set item 5 of rowTemplate to item 1 of my decoupe(result, quote)

# Use the shell function textutil to convert the html contents in text (utf8) one
quoted form of POSIX path of fichierHtml
do shell script "textutil -convert txt " & result

tell application "System Events" to tell disk item (fichierHtml as text)
	set {nomHtml, dossierHtml, extHtml} to {name, path of container, name extension}
	(text 1 thru -(1 + (count extHtml)) of nomHtml) & "txt"
end tell
set fichierTexte to dossierHtml & result

set enMorceaux to paragraphs of (read file fichierTexte as «class utf8»)
repeat with i from 1 to count enMorceaux
	set uneFiche to rowTemplate
	set uneValeur to (item i of enMorceaux) as text
	# make a bit of cleaning
	if uneValeur contains tab then set uneValeur to my remplace(uneValeur, tab,space)
	# Replace character $FFFC (OBJECT REPLACEMENT CHARACTER) by a space
	if uneValeur contains (character id 65532) then
		set uneValeur to my remplace(uneValeur, (character id 65532), space)
	end if
	
	(*
	# Replacing every character $2028 (LINE SEPARATOR) by a space
	# is easier than removing leading ones but it may be annoying for you
	if uneValeur contains (character id 8232) then
		set uneValeur to my remplace(uneValeur, (character id 8232), space)
	end if
	*)
	# Remove every leading character $2028 (LINE SEPARATOR) if there is no space before
	repeat while uneValeur starts with (character id 8232)
		set uneValeur to (characters 2 thru -1 of uneValeur) as text
	end repeat
	
	repeat while uneValeur starts with space
		if uneValeur = space then
			set uneValeur to ""
			exit repeat
		end if
		set uneValeur to text 2 thru -1 of uneValeur
	end repeat
	
	repeat while uneValeur ends with space
		set uneValeur to text 1 thru -2 of uneValeur
	end repeat
	# Remove every leading character $2028 (LINE SEPARATOR) if there was space(s) before
	repeat while uneValeur starts with (character id 8232)
		set uneValeur to (characters 2 thru -1 of uneValeur) as text
	end repeat
	
	# I didn't got original values resembling to :
	# "     "&(LINE SEPARATOR)&"   wxyz" or (LINE SEPARATOR)&"     "&(LINE SEPARATOR)&"   wxyz"
	# but I decided to play safe.
	#  New attempt to remove space char in case the original value was something like :
	# "     "&(LINE SEPARATOR)&"   wxyz" or (LINE SEPARATOR)&"     "&(LINE SEPARATOR)&"   wxyz"
	repeat while uneValeur starts with space
		if uneValeur = space then
			set uneValeur to ""
			exit repeat
		end if
		set uneValeur to text 2 thru -1 of uneValeur
	end repeat
	
	repeat while uneValeur ends with space
		set uneValeur to text 1 thru -2 of uneValeur
	end repeat
	
	# Skip empty values 
	if (count of uneValeur) > 0 then
		if uneValeur contains delim then
			set item 7 of uneFiche to quote & uneValeur & quote
		else
			set item 7 of uneFiche to uneValeur
		end if
		set end of enListe to my recolle(uneFiche, delim)
	end if # not uneValeur	
end repeat

my recolle(enListe, return)
my writeTo(cheminTexte, result, «class utf8», false)

tell application "System Events"
	if exists file cheminCsv then delete file cheminCsv
	set name of disk item cheminTexte to (rapport & "csv")
end tell

set AppleScript's text item delimiters to oTIDs

--=====

on decoupe(t, d)
	local oTIDs, l
	set oTIDs to AppleScript's text item delimiters
	set AppleScript's text item delimiters to d
	set l to text items of t
	set AppleScript's text item delimiters to oTIDs
	return l
end decoupe

--=====

on recolle(l, d)
	local oTIDs, t
	set oTIDs to AppleScript's text item delimiters
	set AppleScript's text item delimiters to d
	set t to "" & l
	set AppleScript's text item delimiters to oTIDs
	return t
end recolle

--=====
(*
replaces every occurences of d1 by d2 in the text t
*)
on remplace(t, d1, d2)
	local oTIDs, l
	set oTIDs to AppleScript's text item delimiters
	set AppleScript's text item delimiters to d1
	set l to text items of t
	set AppleScript's text item delimiters to d2
	set t to "" & l
	set AppleScript's text item delimiters to oTIDs
	return t
end remplace

--=====
(*
Handler borrowed to Regulus6633 - http://macscripter.net/viewtopic.php?id=36861
*)
on writeTo(targetFile, theData, dataType, apendData)
	-- targetFile is the path to the file you want to write
	-- theData is the data you want in the file.
	-- dataType is the data type of theData and it can be text, list, record etc.
	-- apendData is true to append theData to the end of the current contents of the file or false to overwrite it
	try
		set targetFile to targetFile as text
		set openFile to open for access file targetFile with write permission
		if not apendData then set eof of openFile to 0
		write theData to openFile starting at eof as dataType
		close access openFile
		return true
	on error
		try
			close access file targetFile
		end try
		return false
	end try
end writeTo

--=====

on activateGUIscripting()
	(* to be sure than GUI scripting will be active *)
	tell application "System Events"
		if not (UI elements enabled) then set (UI elements enabled) to true
	end tell
end activateGUIscripting

--=====
(*
==== Uses GUIscripting ==== 
*)
(*
This handler may be used to 'type' text, invisible characters if the third parameter is an empty string. 
It may be used to 'type' keyboard raccourcis if the third parameter describe the required modifier keys. 

I changed its name « shortcut » to « raccourci » to get rid of a name conflict in Smile. 
*)
on raccourci(a, t, d)
	local k
	activate application a
	tell application "System Events" to tell application process a
		set frontmost to true
		try
			t * 1
			if d is "" then
				key code t
			else if d is "c" then
				key code t using {command down}
			else if d is "a" then
				key code t using {option down}
			else if d is "k" then
				key code t using {control down}
			else if d is "s" then
				key code t using {shift down}
			else if d is in {"ac", "ca"} then
				key code t using {command down, option down}
			else if d is in {"as", "sa"} then
				key code t using {shift down, option down}
			else if d is in {"sc", "cs"} then
				key code t using {command down, shift down}
			else if d is in {"kc", "ck"} then
				key code t using {command down, control down}
			else if d is in {"ks", "sk"} then
				key code t using {shift down, control down}
			else if (d contains "c") and (d contains "s") and d contains "k" then
				key code t using {command down, shift down, control down}
			else if (d contains "c") and (d contains "s") and d contains "a" then
				key code t using {command down, shift down, option down}
			end if
		on error
			repeat with k in t
				if d is "" then
					keystroke (k as text)
				else if d is "c" then
					keystroke (k as text) using {command down}
				else if d is "a" then
					keystroke k using {option down}
				else if d is "k" then
					keystroke (k as text) using {control down}
				else if d is "s" then
					keystroke k using {shift down}
				else if d is in {"ac", "ca"} then
					keystroke (k as text) using {command down, option down}
				else if d is in {"as", "sa"} then
					keystroke (k as text) using {shift down, option down}
				else if d is in {"sc", "cs"} then
					keystroke (k as text) using {command down, shift down}
				else if d is in {"kc", "ck"} then
					keystroke (k as text) using {command down, control down}
				else if d is in {"ks", "sk"} then
					keystroke (k as text) using {shift down, control down}
				else if (d contains "c") and (d contains "s") and d contains "k" then
					keystroke (k as text) using {command down, shift down, control down}
				else if (d contains "c") and (d contains "s") and d contains "a" then
					keystroke (k as text) using {command down, shift down, option down}
				end if
			end repeat
		end try
	end tell
end raccourci

--=====

on safeCopy(theApp)
	local tt
	(*
Fill the clipboard with a fake string *)
	set tt to "All The Things You Could Be By Now If Sigmund Freud's Wife Was Your Mother, © Charles Mingus"
	tell current application to set the clipboard to tt
	(*
Copy the selected item *)
	my raccourci(theApp, "c", "c")
	(*
Loop waiting the achievement of the Copy task. *)
	repeat 50 times
		try
			tell current application
				the clipboard as text
			end tell
			if result is not tt then exit repeat
		on error
			delay 0.1
		end try
	end repeat
end safeCopy

--=====

Yvan KOENIG (VALLAURIS, France) jeudi 20 septembre 2012 17:47:52

McUsr · September 22, 2012, 8:06am

Hello!

I am away this weekend, so I have to put things a bit on ice, this post just to show some progress, since it may be of interest to understand how to write a parser, though small, I still have the intent of writing a parser for html, so parser it is.

Before you say you’ll solve a problem, you usually have a hunch that you can make it. And so did I, it should be straight forward really to break down a html structure, and translate it into lines of CSV values, finally creating a file, with the constraints stated above. All it takes is to follow the nesting of the tags, and extract data for the columns as we go along.

I have broken the the problem down into three problem areaes, the first one being how to be able to transform the nested structure of the html to csv.

Recursive routines tend to go well with parser, and so I figured that would be the best place to start.

Now, Applescript doesn’t go that well with recursive routines really, the stack depth is very limited, the other thing is that we’ll have to parse stuff into both lines and columns, so I had to experiment a bit to make up a recursive concept that works for creating lines and columns.

There isn’t much to say about this, than that it keeps the stack as shallow as possible, and catering for lines and columns, this is what I’ll use for the insertion of html into a csv table, not exactly like this, but this is the concept:

property MAXCOLS : 5 -- 300
property MAXROWS : 10
on run
	
	
	global theList, theTable, rows, cols
	set {theList, theTable, rows, cols} to {{"row :1"}, {}, 1, 0}
	recurseRows(theList, theTable, rows, cols)
	log (count theTable)
end run

to recurseRows(theList, theTable, rows, cols)
	local rowtext
	
	repeat while rows â‰¤ MAXROWS
		
		set cols to cols + 1
		set end of theList to cols
		
		if cols = MAXCOLS then
			set end of theTable to theList
		else
			recurseRows(theList, theTable, MAXROWS, cols)
			set cols to 0
		end if
		
		set rows to rows + 1
		set rowtext to "row :" & rows
		set theList to {rowtext}
	end repeat
	
end recurseRows

McUsr · September 24, 2012, 12:10pm

Hello!

The problem is now hopefully analysed, the next step in writing the parser, is to have a test bed, so I can test the progress as I go along. The recursion, with something more, is under construction. I also guess I have to write somewhere about what it will do, and won’t. I’ll come back to that later.


property tlvl : me
# The main handler for parsing html into a csv table
script html2csv
	property parent : AppleScript
	property scriptTitle : "Html2Csv"
	on run
		local theCsvTable, theFileContents, fna, e, n
		try
			set {fna, theCsvTable} to {choose file, {}}
			
			set theFileContents to tlvl's readFileAsUtf8(fna)
			
			# preparation of the data removing of surrounding body block
			local startpos, endpos
			set startpos to closureposOfATag for theFileContents by "<body"
			set endpos to offset of "</body>" in theFileContents
			set theFileContents to text startpos thru (endpos - 1) of theFileContents
			
			# The actual parsing takes place here, will use prev defined recursion.
			set theCsvTable to parse(theFileContents, theCsvTable) # in progress 
			
			# write the table to disk
			tlvl's writeToFileAsUtf8((fna as text) & ".csv", theCsvTable)
			
		on error e number n
			local cr, sep, errmsg
			-- Chris Stone
			set {cr, sep} to {return, "------------------------------------------"}
			set errmsg to sep & cr & "Error: " & e & cr & sep & cr & "Error 
		Number: " & n & cr & sep
			try
				tell application "SystemUIServer"
					activate
					display dialog errmsg with title scriptTitle buttons {"Ok"} default button 1
				end tell
			end try
		end try
	end run
	
	to closureposOfATag for aText by startOfAtag
		# text is presumed to start at pos 1
		local startTagPos, endTagPos, cl, hyphCount
		
		set startTagPos to offset of startOfAtag in aText
		
		if startTagPos = 0 then error "closureposOfATag : Missing Tag" number 3099
		
		set {cl, hyphCount} to {(every character of aText), 0}
		
		repeat with i from (startTagPos + (length of startOfAtag)) to (length of aText)
			if item i of cl = "\"" then
				set hyphCount to hyphCount + 1
			else if item i of cl = ">" and hyphCount mod 2 = 0 then
				return (i + 1)
			end if
		end repeat
	end closureposOfATag
	
	to parse(utf8HtmlText, csvTable) # work in progress 
		local washedHtml, curLine
		return utf8HtmlText
		set {washedHtmlText, curLine} to {rinseHtml(utf8HtmlText), {}}
		
		parse2csv for washedHtmlText against csvTable by curLine
		
		# convert csvTable to text - for this to be easy, it would be smart to use text fields for starters.
		# just adding return as line-endings!
		
		return csvTable
		
	end parse
	
end script

############
tell html2csv to run
############

on writeToFileAsUtf8(fname, stuff)
	local fref, e, n, fsz
	set fref to open for access fname with write permission
	try
		write stuff to fref as «class utf8» starting at 0
		set fsz to get eof fref
		close access fref
		return fsz
	on error e number n
		try
			close access fref
		end try
		error "writeToFileAsUtf8 " & e number n
	end try
end writeToFileAsUtf8


to readFileAsUtf8(alisForTheFile)
	local fcontents, ftr, e, n
	try
		set ftr to (open for access (alisForTheFile))
		# her stjeler vi hva vi kan av Yvan koenig, nÃ¥r det gjelder uti'er og csv.
	on error e number n
		error "readFileAsUtf8 " & e number n
	end try
	
	try
		set fcontents to (read ftr as «class utf8»)
	on error e number n
		close access ftr
		error "readFileAsUtf8 " & e number n
	end try
	
	--close the file
	try
		close access ftr
	on error e number n
		close access ftr
		error "readFileAsUtf8 " & e number n
	end try
	return fcontents
end readFileAsUtf8

McUsr · September 24, 2012, 7:22pm

The testbed worked good!

I am now to start with the actual parsing of the html, transposing it into a csv table, under pretty much the conditions I stated for starters. that the OP will have to clean up the csv later on, by deleting rows and columns, Which isn’t that hard when you do it in Excel anyway.

It has struck me in the mean time, that this might work well, as an abstraction tool, for people that wants to get information off web pages on a regular basis, without having to resort to get the data out of the DOM tree.

The rinsing of the html for uneccessary tags is now implemented, to make the actual parsing and transposing easier, which is the next step.

I got some handlers by taking this prelimnary step, that I can reuse later. As for the “tag-Datastructures”, no final decision are taken about those yet, I have just implemented something that works for the moment, which I think I may optimize as much as I can by spezialized handlers later on.


# 1: We're implementing the rinse html handler
# We do this up front, or we'll have to drag it with us during the parsing, which is 
# complicating matters, and making things go slower.

property tlvl : me
# The main handler for parsing html into a csv table

script html2csv
	property parent : AppleScript
	property scriptTitle : "Html2Csv"
	
	property tagCategories : {"singleLine", "discardables",  "edibles", "digestives"}
	property _singleLineTags : {{"<br />"}, {"<hr />"}, {"<embed/>"}}
	property singleLineTags : missing value # single list countepart, for tests.
	
	property _discardables : {{"<!--;", "-->"}, {"<pre;", "</pre>"}, {"<code;", "</code>"}, {"<object;", "</object>"}, {"<form;", "</form>"}, {"<script;", "</script>"}, {"<embed;", "</embed>"}, {"<b;", "</b>"}, {"<i;", "</i>"}, {"<u;", "</u>"}, {"<small;", "</small>"}, {"<strong;", "</strong>"}, {"<strike;", "</strike>"}, {"<em;", "</em>"}, {"<span;", "</span>"}, {"<big;"}, {"</big>"}, {"<aside;", "</aside>"}, {"<footer;", "</footer>"}}
	
	property discardables : missing value
	
	# those tags are really ignored, so we can use them in their full form.
	# This may lead to bugs, if there are classes or id or other attributes
	# set on those, but this is nothing I see as a problem, since it will really
	# just be ignored, and we have no semantic use of classes or id's inside those
	# tags anyway.
	
	property _edibles : {{"<div;", "</div>"}, {"<nav>", "</nav>"}, {"<section;", "</section>"}, {"<ul;", "</ul>"}, {"<ol;", "</ol>"}}
	property edibles : missing value
	
	property _digestives : {{"<p;", "</p>"}, {"<li;", "</li"}, {"<a;", "</a>"}, {"<image;", "</image>"}}
	property digestives : missing value
	
	property __tagDictsInited : missing value
	
	on run
		local theCsvTable, theFileContents, fna, e, n
		try
			set {fna, theCsvTable} to {choose file, {}}
			
			set theFileContents to tlvl's readFileAsUtf8(fna)
			
			# preparation of the data removing of surrounding body block
			local startpos, endpos
			set startpos to closureposOfATag for theFileContents by "<body"
			set endpos to offset of "</body>" in theFileContents
			set theFileContents to text startpos thru (endpos - 1) of theFileContents
			
			# The actual parsing takes place here, will use prev defined recursion.
			set theCsvTable to parse(theFileContents, theCsvTable) # in progress 
			
			# write the table to disk
			tlvl's writeToFileAsUtf8((fna as text) & ".csv", theCsvTable)
			
		on error e number n
			local cr, sep, errmsg
			-- Chris Stone
			set {cr, sep} to {return, "------------------------------------------"}
			set errmsg to sep & cr & "Error: " & e & cr & sep & cr & "Error 
		Number: " & n & cr & sep
			try
				tell application "SystemUIServer"
					activate
					display dialog errmsg with title scriptTitle buttons {"Ok"} default button 1
				end tell
			end try
		end try
	end run
	
	to closureposOfATag for aText by startOfAtag
		# text is presumed to start at pos 1
		local startTagPos, endTagPos, cl, hyphCount
		
		set startTagPos to offset of startOfAtag in aText
		
		if startTagPos = 0 then error "closureposOfATag : Missing Tag" number 3099
		
		set {cl, hyphCount} to {(every character of aText), 0}
		
		repeat with i from (startTagPos + (length of startOfAtag)) to (length of aText)
			if item i of cl = "\"" then
				set hyphCount to hyphCount + 1
			else if item i of cl = ">" and hyphCount mod 2 = 0 then
				return (i + 1)
			end if
		end repeat
	end closureposOfATag
	
	to nextTagHead(startpos, htmlText) # returns head of tag, and startpos of it.
		# we don't consider hyphens, as we should really not meet a hyphen before we reach
		# an end of a tag.
		local startTagPos, endTagPos, cl, hyphCount
		
		set startTagPos to offset of "<" in (text startpos thru -1 of htmlText)
		if startTagPos = 0 then return null # We're done.
		# special consdiderations for comment tags"
		if text (startpos + startTagPos - 1) thru (startpos + startTagPos + 2) of htmlText = "<!--" then
			set startOffset to (startpos + startTagPos - 1)
			return {(text startOffset thru (startOffset + 3) of htmlText) & ";", startOffset}
		else
			set startOffset to (startpos + startTagPos - 1)
			set endTagPos to offset of ">" in (text startOffset thru -1 of htmlText)
			set spacePos to offset of space in (text startOffset thru -1 of htmlText)
			
			if endTagPos < spacePos then
				return {(text startOffset thru (startOffset + endTagPos - 2) of htmlText) & ";", startOffset}
			else
				return {(text startOffset thru (startOffset + spacePos - 2) of htmlText) & ";", startOffset}
			end if
		end if
	end nextTagHead
	
	
	to rinseHtml(htmlText)
		# removes as much as possible of stuff we really don't need to deal with.
		# this routine lends itself to the usage of a stack and a record,
		# to make it easy for us, to truncate away the unnecessary tags.
		
		# find next tag in the text, decide if the head of it is among the discardables
		# if it is, keep the items, and find the end of it
		# push the start and end posistion down, onto a stack.
		local spos, discardsStack
		
		set discardsStack to tlvl's Stack's makeStack()
		set spos to 1
		repeat
			set theTagHeadData to nextTagHead(spos, htmlText)
			if theTagHeadData is not null then
				set spos to item 2 of theTagHeadData
				set thetaghead to item 1 of theTagHeadData
			else
				exit repeat
			end if
			
			
			if ismember(thetaghead, discardables) then
				set tagtail to tagCounterPart(thetaghead, discardables)
				set endpos to findTagEndPos((spos + (length of thetaghead)), tagtail, htmlText)
				discardsStack's Push({spos, endpos})
				set spos to endpos
			else if ismember(thetaghead, singleLineTags) then
				set endpos to spos + 2
				discardsStack's Push({spos, endpos})
				set spos to endpos
			else
				set spos to spos + 2
			end if
		end repeat
		# We'll contract the text from behind - forwards
		
		local pos
		set pos to discardsStack's Pop()
		
		repeat while pos is not {}
			set htmlText to text 1 thru ((item 1 of pos) - 1) of htmlText & text ((item 2 of pos) + 1) thru -1 of htmlText
			set pos to discardsStack's Pop()
		end repeat
		
		return htmlText
	end rinseHtml
	
	to findTagEndPos(startpos, tagtail, htmlText)
		local startTagPos, endTagPos, cl, hyphCount
		
		set startTagPos to offset of tagtail in (text startpos thru -1 of htmlText)
		if startTagPos = 0 then error "findTagEndPos : Missing Tag" number 3099
		set startOffset to (startpos + startTagPos - 1)
		set endTagPos to (length of tagtail) + startOffset - 1
		return endTagPos
	end findTagEndPos
	
	
	to tagCounterPart(thetaghead, theTagSet)
		local linenumber
		set linenumber to tlvl's indexOfItem(thetaghead, theTagSet)
		return item (linenumber + 1) of theTagSet
	end tagCounterPart
	
	to parse(utf8HtmlText, csvTable) # work in progress 
		local washedHtml, curLine
		if __tagDictsInited is missing value then initSetsOfTags()
		
		-- return utf8HtmlText
		set {washedHtmlText, curLine} to {rinseHtml(utf8HtmlText), {}}
		return 0 # Not further 
		parse2csv for washedHtmlText against csvTable by curLine
		
		# convert csvTable to text - for this to be easy, it would be smart to use text fields for starters.
		# just adding return as line-endings!
		
		return csvTable
		
	end parse
	
	
	to ismember(theTag, theSet)
		# flattens the list into one long thing, good to search within
		# I believe this will do the trick whatever the nesting.
		if theSet contains theTag then
			return true
		else
			return false
		end if
	end ismember
	
	to itsSet(theTag)
		if ismember(theTag, singleLineTags) then
			return "singleLine"
		else if ismember(theTag, discardables) then
			return "discardables"
		else if ismember(theTag, edibles) then
			return "edibles"
		else if ismember(theTag, digestives) then
			return "digestives"
		else
			error number 3399
		end if
	end itsSet
	
	to initSetsOfTags()
		
		if singleLineTags is missing value then
			set singleLineTags to aFlatList of tlvl by _singleLineTags
		end if
		
		if discardables is missing value then
			set discardables to aFlatList of tlvl by _discardables
		end if
		
		if edibles is missing value then
			set edibles to aFlatList of tlvl by _edibles
		end if
		
		if digestives is missing value then
			set digestives to aFlatList of tlvl by _digestives
		end if
		
		set __tagDictsInited to true
		
	end initSetsOfTags
	
end script

############
tell html2csv to run
############

on writeToFileAsUtf8(fname, stuff)
	local fref, e, n, fsz
	set fref to open for access fname with write permission
	try
		write stuff to fref as «class utf8» starting at 0
		set fsz to get eof fref
		close access fref
		return fsz
	on error e number n
		try
			close access fref
		end try
		error "writeToFileAsUtf8 " & e number n
	end try
end writeToFileAsUtf8


to readFileAsUtf8(alisForTheFile)
	local fcontents, ftr, e, n
	try
		set ftr to (open for access (alisForTheFile))
		# her stjeler vi hva vi kan av Yvan koenig, nÃ¥r det gjelder uti'er og csv.
	on error e number n
		error "readFileAsUtf8 " & e number n
	end try
	
	try
		set fcontents to (read ftr as «class utf8»)
	on error e number n
		close access ftr
		error "readFileAsUtf8 " & e number n
	end try
	
	--close the file
	try
		close access ftr
	on error e number n
		close access ftr
		error "readFileAsUtf8 " & e number n
	end try
	return fcontents
end readFileAsUtf8

to aFlatList by nestedList
	local tids, flatlist
	set {tids, AppleScript's text item delimiters} to {AppleScript's text item delimiters, return}
	set flatlist to text items of (nestedList as text)
	set AppleScript's text item delimiters to tids
	return flatlist
end aFlatList

on indexOfItem(theItem, itemsList) -- credit to Emmanuel Levy 
	local rs
	set text item delimiters to return
	set itemsList to return & itemsList & return
	set text item delimiters to {""}
	try
		set rs to -1 + (count (paragraphs of (text 1 thru (offset of (return & theItem & return) in itemsList) of itemsList)))
	on error
		return 0
	end try
	rs
end indexOfItem

on getSingelton(the_list, item_a)
	set astid to AppleScript's text item delimiters
	-- Nigel Garvey's with a name change
	set AppleScript's text item delimiters to return
	set the_list_as_string to return & the_list & return
	set AppleScript's text item delimiters to return & item_a & return
	if (the_list_as_string contains result) then
		set p to (count paragraphs of text item 1 of the_list_as_string)
		if (p is 0) then set p to 1 -- Catch modern paragraph count for empty text.
		set p to p mod 2
		try
			set otherItem to paragraph (p * 2 - 1) of text item (p + 1) of the_list_as_string
		on error
			return null
		end try
		set AppleScript's text item delimiters to astid
		
		return otherItem
	else
		return null
	end if
end getSingelton

script Stack
	property parent : AppleScript
	property __stack : missing value
	on init()
		set my __stack to {}
		return me
	end init
	on Push(athing)
		set my __stack to {athing} & my __stack
	end Push
	
	on Pop()
		local athing
		try
			set athing to first item of my __stack
		on error
			return {}
		end try
		set my __stack to rest of my __stack
		return athing
	end Pop
	
	on Peek() # For debugging
		local athing
		try
			set athing to item 1 of my __stack
		on error
			return {}
		end try
		return athing
	end Peek
	
	to makeStack()
		script Stack
			property parent : tlvl's Stack # For getting at the parent
			property __stack : missing value # For a unique stack!
		end script
		return Stack's init() # My instance.
	end makeStack
end script

Yvan_Koenig · September 24, 2012, 7:56pm

Hello McUsr

I ran the script to treat the three sample files which pendolo sent to me.

The result is a single cell containing a zero.

May you send me a mail to

koenig yvan sfr fr ?

Yvan KOENIG (VALLAURIS, France) lundi 24 septembre 2012 21:56:22

McUsr · September 24, 2012, 11:19pm

Hello.

It isn’t supposed to deliver output at this moment, if you read the post. It only cleans the html for tags that isn’t in use. The actual parsing of the html and transformation into csv comes in the next post.

McUsr · September 26, 2012, 2:42pm

Hello!

I have something that actually parses the page in question flawlessly, but spits out too much gibberish at the moment.

The problem is, that I did a couple of “smart moves” in order to keep the levels of recursion down.

I am fed up with the whole thing, and needs a couple of days break from this, Here is the handler that just parses the html into columns for the moment, you can’t just put it into the code above, as some other things has been changed as well.


to parse2csv for cleanedHtml against aCsvTable by aCurline
	local rowtext, currentTagData, currentTag, thetaghead
	
	repeat
		set currentTagData to dataForNextTag by cleanedHtml
		set currentTag to theTag of currentTagData
		if currentTag is null then return "" # We're done!
		if (offset of "/" in currentTag) is 2 then # it is a tagTail.
			# nothing to add : we just remove the text and return
			
			if length of cleanedHtml > ((theTagEnd of currentTagData) + 1) then
				return (text ((theTagEnd of currentTagData) + 1) thru -1 of cleanedHtml)
			else
				return ""
			end if
		else
			set thetaghead to item 1 of nextTagHead(1, currentTag) # This is the CURRENT TAG HEAD
			
			# get the record with the attributes from the tag
			global attrRec
			set attrRec to extractInfo for currentTag
			
			# needs to find the tag head, and classify it
			
			if ismember(thetaghead, digestives) then
				local curlineCopy
				copy aCurline to curlineCopy
				# adds columns to the current line		
				set curlineCopy to addAtrribs for curlineCopy by attrRec
				
				if thetaghead is "<img;" then
					# get the alt attribute
					if altT of attrRec is not "\"\"" then set curlineCopy to curlineCopy & ";" & altT of attrRec
				else if thetaghead is "<a;" then
					if hrefT of attrRec is not "\"\"" then set curlineCopy to curlineCopy & ";" & hrefT of attrRec
					# get the href attribute
				end if
				# find the next tag -Lookahead
				local nextTagData, nextTag
				set nextTagData to dataForNextTag by (text ((theTagEnd of currentTagData) + 1) thru -1 of cleanedHtml)
				set nextTag to theTag of nextTagData
				
				local sofs, eofs
				# position relative to the current chunk of text to return what is between two tags. (digestive)
				set sofs to ((theTagEnd of currentTagData) + 1) # inside 
				# 
				set eofs to sofs + (theTagStart of nextTagData) - 2 # one for pos, 1 for inside 
				#
				if nextTag = null then error "Malformed unbalanced html" number 3398
				
				if thetaghead is "<img;" then
					# It is a singular tag, we can process it right away!
					
					
					if ((offset of "/" in nextTag) is 2) then
						copy curlineCopy to end of aCsvTable
						set cleanedHtml to text sofs thru -1 of cleanedHtml
					else
						# because we are about to return 
						return text sofs thru -1 of cleanedHtml
					end if
				else if ((offset of "/" in nextTag) is 2) then # it is a tagTail.
					
					# we'll extract stuff, between tag start and tag end. and add it to a line
					local newColumn
					if sofs > eofs then
						set newColumn to "\"\""
					else
						set newColumn to text sofs thru eofs of cleanedHtml
					end if
					# consume the rest of the stuff, ( up 2 but AND  including the tail tag.)
					# ** add the column to the line, (local variable)
					-- 	local copyofACurline
					
					set curlineCopy to curlineCopy & ";\"" & newColumn & "\""
					# ** append the line to the table
					copy curlineCopy to end of aCsvTable
					
					
					set cleanedHtml to text (eofs + (length of nextTag) + 1) thru -1 of cleanedHtml
				else
					# call self with the text that is left. not touching the tag
					local newColumn
					if sofs > eofs then
					else
						set newColumn to text sofs thru eofs of cleanedHtml
						set curlineCopy to curlineCopy & ";\"" & newColumn & "\""
					end if
					# consume the rest of the stuff, ( up 2 but AND  including the tail tag.)
					set cleanedHtml to parse2csv for (text sofs thru -1 of cleanedHtml) against aCsvTable by curlineCopy
				end if
				
			else # tag is member of edibles
				# stuff  the attributes into columns
				# adds columns to the current line
				set aCurline to addAtrribs for aCurline by attrRec
				
				# call self with whats left 
				set cleanedHtml to parse2csv for (text ((theTagEnd of currentTagData) + 1) thru -1 of cleanedHtml) against aCsvTable by aCurline
				
			end if
		end if
	end repeat
	
end parse2csv