CSV > sqlite3; clean up values before entering

sam452 · March 11, 2010, 11:43pm

I’m excited about using sqlite3 to solve some problems. I want to make a droplet that will take a CSV file and create a database.

I’ve used these handlers that were kindly uploaded and where I need some advice.

http://macscripter.net/viewtopic.php?id=27928

Here’s a representative string of a row I want to insert into a table:

{“caiv2,2010BBDGRC,"President, Board of Directors"”}

I’d like to end up with (‘caiv2’,‘2010BBDGRC’,"President, Board of Directors") to insert.

You can see that I need help with how to handle the escaped field value.

But before that, I need to get the values ready to work. The handlers seem to work great except for the values. I can say that I’m able to create the db, tables and columns, but I’m breaking on the values. Reading the comments, this ought to work but it’s breaking as below:


set a to {"caiv2,2010BBDGRC,\"President, Board of Directors\""}
set b to my tidStuff(quote, a) --> {"caiv2,2010BBDGRC,\"President, Board of Directors\""} Doesn't help me.
set c to my returnCommaSepQuotedString(a, quote) --> "\"caiv2,2010BBDGRC,\"\"President, Board of Directors\"\"\"" --as you can see it doesn't break up into single quotes
set d to my returnCommaSepQuotedStringEntities(a) --> "\"\"\"caiv2,2010BBDGRC,\"\"\"\"President, Board of Directors\"\"\"\"\"\"\"" --This was just to try it.

return c

-----

-- TURN STRING INTO LIST
on tidStuff(paramHere, textHere)
	set OLDtid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to paramHere
	set theItems to text items of textHere
	set AppleScript's text item delimiters to OLDtid
	return theItems
end tidStuff

--for colums
on returnCommaSepQuotedStringEntities(the_array) -- use for column names (eg in create_db), table names, not values
	set quoteChar to quote -- AppleScript's reserved word for a double quote: "
	return returnCommaSepQuotedString(the_array, quoteChar)
end returnCommaSepQuotedStringEntities

on returnCommaSepQuotedStringValues(the_array) -- use for values in rows (eg in sql_insert), not for column names
	set quoteChar to "'" -- single quote
	returnCommaSepQuotedString(the_array, quoteChar)
end returnCommaSepQuotedStringValues

on returnCommaSepQuotedString(the_array, quoteChar)
	set oldDelimiters to AppleScript's text item delimiters
	-- Find any text values, quote them and escape any quote characters with quotes:
	repeat with item_ref in the_array
		set item_value to contents of item_ref
		if class of item_value is text then
			if item_value contains quoteChar then
				set AppleScript's text item delimiters to quoteChar
				set parsedList to text items in item_value
				set AppleScript's text item delimiters to (quoteChar & quoteChar)
				set item_value to parsedList as text
			end if
			set item_value to (quoteChar & item_value & quoteChar) as text
			set contents of item_ref to item_value
		end if
	end repeat
	-- Join all the values together as a string, separated by commas:
	set AppleScript's text item delimiters to ", "
	set return_string to the_array as text
	set AppleScript's text item delimiters to oldDelimiters
	return return_string
end returnCommaSepQuotedString

The handler’s see the one string as one field so it’s breaking down.

How do I use it, or something like to break down the one text string into values, while keeping the escaped value because I really need that comma as part of the customer’s job title.

What am I overlooking? I’ve beaten myself up trying to see past this, thanx, sam

Craig_Williams · March 12, 2010, 2:11am

You may find something useful here

sam452 · March 12, 2010, 7:31pm

Thanks, Craig, but this was helpful, but not for this problem.

If I have a list of CSV as in the above, how would I extract out the actual values without using comma as a text item delimiter?

I cannot use a comma because it would rip this value:

"President, Board of Directors"

How should I parse out:

{“caiv2,2010BBDGRC,"President, Board of Directors"”}

to get:

{‘caiv2’,‘2010BBDGRC’,"President, Board of Directors"}

thanx, sam

Craig_Williams · March 13, 2010, 7:34pm

Get all of the quoted elements first and then get the comma-sep elements then put them back together.

Adam_Bell · March 13, 2010, 8:50pm

If you’re not to fussy about retaining commas in the third element:

set tStart to {"caiv2,2010BBDGRC,\"President, Board of Directors\""}
-- note that this is a list of one item; a string.
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to ","
set tParts to text items of (tStart as text)
set AppleScript's text item delimiters to tid
-- assuming that items 1 and 2 never have commas
-- and that you don't mind losing the comma in the
-- title:
tell tParts
	set CSV_1 to item 1 --> "caiv2"
	set CSV_2 to item 2 --> "2010BBDGRC"
	set CSV_3 to items 3 thru -1 as text --> "\"President Board of Directors\""
end tell

To get fancier than that, you’d have to know more about that third part.

Nigel_Garvey · March 13, 2010, 9:58pm

I wrote a CSV-to-list converter a few years ago. I don’t know if it’s what you want.

(* Assumes that the CSV text follows the RFC 4180 convention:
	Records are delimited by CRLF line breaks (but LFs or CRs are OK too with this script).
	The last record in the text may or may not be followed by a line break.
	Fields in the same record are separated by commas (but a different separator can be specified here with an optional parameter).
	The last field in a record is NOT followed by a field separator. Each record has (number of separators + 1) fields, even when these are empty.
	All the records should have the same number of fields (but this script just renders what it finds).
	Any field value may be enquoted with double-quotes and should be if the value contains line breaks, separator characters, or double-quotes.
	Double-quote pairs within quoted fields represent escaped double-quotes.
	Trailing or leading spaces in unquoted fields are part of the field values (but trimming can specified here with an optional parameter).
	By implication, spaces (or anything else!) outside the quotes of quoted fields are not allowed.
		
	No other variations are currently supported. *)

on csvToList(csvText, implementation)
	-- The 'implementation' parameter is a record with optional properties specifying the field separator character and/or trimming state. The defaults are: {separator:",", trimming:false}.
	set {separator:separator, trimming:trimming} to (implementation & {separator:",", trimming:false})
	
	script o -- For fast list access.
		property textBlocks : missing value -- For the double-quote-delimited text item(s) of the CSV text.
		property possibleFields : missing value -- For the separator-delimited text items of a non-quoted block.
		property subpossibilities : missing value -- For the paragraphs of any non-quoted field candidate actually covering multiple records. (Single-column CSVs only.)
		property fieldsOfCurrentRecord : {} -- For the fields of the CSV record currently being processed.
		property finalResult : {} -- For the final list-of-lists result.
	end script
	
	set astid to AppleScript's text item delimiters
	
	considering case
		set AppleScript's text item delimiters to quote
		set o's textBlocks to csvText's text items
		-- o's textBlocks is a list of the CSV text's text items after delimitation with the double-quote character.
		-- Assuming the convention described at top of this script, the number of blocks is always odd.
		-- Even-numbered blocks, if any, are the unquoted contents of quoted fields (or parts thereof) and don't need parsing.
		-- Odd-numbered blocks are everything else. Empty strings in odd-numbered slots (except at the beginning and end) are due to escaped double-quotes in quoted fields.
		
		set blockCount to (count o's textBlocks)
		set escapedQuoteFound to false
		-- Parse the odd-numbered blocks only.
		repeat with i from 1 to blockCount by 2
			set thisBlock to item i of o's textBlocks
			if (((count thisBlock) > 0) or (i is blockCount)) then
				-- Either this block is not "" or it's the last item in the list, so it's not due to an escaped double-quote. Add the quoted field just skipped (if any) to the field list for the current record.
				if (escapedQuoteFound) then
					-- The quoted field contained escaped double-quote(s) (now unescaped) and is spread over three or more blocks. Join the blocks, add the result to the current field list, and cancel the escapedQuoteFound flag. 
					set AppleScript's text item delimiters to ""
					set end of o's fieldsOfCurrentRecord to (items quotedFieldStart thru (i - 1) of o's textBlocks) as text
					set escapedQuoteFound to false
				else if (i > 1) then -- (if this isn't the first block)
					-- The preceding even-numbered block is an entire quoted field. Add it to the current field list as is.
					set end of o's fieldsOfCurrentRecord to item (i - 1) of o's textBlocks
				end if
				
				-- Now parse the current block's separator-delimited text items, which are either complete non-quoted fields, stubs from the removal of quoted fields, or still-joined fields from adjacent records.
				set AppleScript's text item delimiters to separator
				set o's possibleFields to thisBlock's text items
				set possibleFieldCount to (count o's possibleFields)
				repeat with j from 1 to possibleFieldCount
					set thisPossibleField to item j of o's possibleFields
					set c to (count thisPossibleField each paragraph)
					if (c < 2) then
						-- This possible field doesn't contain a line break. If it's not the stub of a preceding or following quoted field, add it (trimmed if trimming) to the current field list.
						-- It's not a stub if it's an inner candidate from the block, the last candidate from the last block, the first candidate from the first block, or it contains non-white characters.
						if (((j > 1) and ((j < possibleFieldCount) or (i is blockCount))) or ((j is 1) and (i is 1)) or (notBlank(thisPossibleField))) then set end of o's fieldsOfCurrentRecord to trim(thisPossibleField, trimming)
					else if (c is 2) then -- Special-cased for efficiency.
						-- This possible field contains a line break, so it's really two possible fields from consecutive records. Split it.
						set subpossibility1 to paragraph 1 of thisPossibleField
						set subpossibility2 to paragraph 2 of thisPossibleField
						-- If the first subpossibility's not just the stub of a preceding quoted field, add it to the field list for the current record.
						if ((j > 1) or (i is 1) or (notBlank(subpossibility1))) then set end of o's fieldsOfCurrentRecord to trim(subpossibility1, trimming)
						-- Add the now-complete field list to the final result list and start one for a new record.
						set end of o's finalResult to o's fieldsOfCurrentRecord
						set o's fieldsOfCurrentRecord to {}
						-- If the second subpossibility's not the stub of a following quoted field, add it to the new list.
						if ((j < possibleFieldCount) or (notBlank(subpossibility2))) then set end of o's fieldsOfCurrentRecord to trim(subpossibility2, trimming)
					else
						-- This possible field contains more than one line break, so it's three or more possible fields from consecutive single-field records. Split it.
						set o's subpossibilities to thisPossibleField's paragraphs
						-- With each subpossibility except the last, complete the field list for the current record and initialise another. Omit the first subpossibility if it's just the stub of a preceding quoted field.
						repeat with k from 1 to c - 1
							set thisSubpossibility to item k of o's subpossibilities
							if ((k > 1) or (j > 1) or (i is 1) or (notBlank(thisSubpossibility))) then set end of o's fieldsOfCurrentRecord to trim(thisSubpossibility, trimming)
							set end of o's finalResult to o's fieldsOfCurrentRecord
							set o's fieldsOfCurrentRecord to {}
						end repeat
						-- With the last subpossibility, just add it to the new field list (if it's not the stub of a following quoted field).
						set thisSubpossibility to end of o's subpossibilities
						if ((j < possibleFieldCount) or (notBlank(thisSubpossibility))) then set end of o's fieldsOfCurrentRecord to trim(thisSubpossibility, trimming)
					end if
				end repeat
				
				-- Otherwise, the current block's an empty text item due to either an escaped double-quote in a quoted field or the opening quote of a quoted field at the very beginning of the CSV text.
			else if (escapedQuoteFound) then
				-- It's another escaped double-quote in a quoted field already flagged as containing one. Just replace the empty text with a literal double-quote.
				set item i of o's textBlocks to quote
			else if (i > 1) then -- (if this isn't the first block)
				-- It's the first escaped double-quote in a quoted field. Replace the empty text with a literal double-quote, note the index of the preceding even-numbered block (the first part of the field), and flag the find.
				set item i of o's textBlocks to quote
				set quotedFieldStart to i - 1
				set escapedQuoteFound to true
			end if
		end repeat
	end considering
	
	set AppleScript's text item delimiters to astid
	
	-- Add the remaining field list to the output if it's not empty or if the output list itself has remained empty.
	if ((o's fieldsOfCurrentRecord is not {}) or (o's finalResult is {})) then set end of o's finalResult to o's fieldsOfCurrentRecord
	
	return o's finalResult
end csvToList

-- Test whether or not a string contains any non-white characters.
on notBlank(txt)
	ignoring white space
		return (txt > "")
	end ignoring
end notBlank

-- Trim any leading or trailing spaces from a string.
on trim(txt, trimming)
	if (trimming) then
		set c to (count txt)
		repeat while ((txt begins with space) and (c > 1))
			set txt to text 2 thru -1 of txt
			set c to c - 1
		end repeat
		repeat while ((txt ends with space) and (c > 1))
			set txt to text 1 thru -2 of txt
			set c to c - 1
		end repeat
		if (txt is space) then set txt to ""
	end if
	
	return txt
end trim

-- Demos:
set csvText to "caiv2 , 2010BBDGRC,\"\"\"President\"\", \"\"Board of Directors\"\"\"" & linefeed & ",\"\"," & linefeed & " , , " & linefeed & "Another line, for demo purposes , " & linefeed & ",," & linefeed
csvToList(csvText, {})
--> {{"caiv2 ", " 2010BBDGRC", "\"President\", \"Board of Directors\""}, {"", "", ""}, {" ", " ", " "}, {"Another line", " for demo purposes ", " "}, {"", "", ""}}
csvToList(csvText, {trimming:true})
--> {{"caiv2", "2010BBDGRC", "\"President\", \"Board of Directors\""}, {"", "", ""}, {"", "", ""}, {"Another line", "for demo purposes", ""}, {"", "", ""}}

set csvText to "caiv2 ; 2010BBDGRC;\"\"\"President\"\"; \"\"Board of Directors\"\"\"" & linefeed & ";\"\";" & linefeed & " ; ; " & linefeed & "Another line; for demo purposes ; " & linefeed & ";;" & linefeed
csvToList(csvText, {separator:";"})
--> {{"caiv2 ", " 2010BBDGRC", "\"President\"; \"Board of Directors\""}, {"", "", ""}, {" ", " ", " "}, {"Another line", " for demo purposes ", " "}, {"", "", ""}}
csvToList(csvText, {separator:";", trimming:true})
--> {{"caiv2", "2010BBDGRC", "\"President\"; \"Board of Directors\""}, {"", "", ""}, {"", "", ""}, {"Another line", "for demo purposes", ""}, {"", "", ""}}

Edit: Script cosmetically overhauled and made very slightly faster. Tested and found to work in Tiger, Leopard, El Capitan, and High Sierra. (For some reason, it’s faster in Leopard on my 2 MHz G5 than in El Capitan on my 2.53 MHz MacBook Pro!)

sam452 · March 15, 2010, 3:45pm

Yes, yes, thank you for seeing what I was trying to communicate. This is an excellent gift to the community. I’ve learned a big deal going through it.

wdockery · December 3, 2016, 8:33pm

Hello, I have gotten good use out of your handler.

I am now pasting it into an applescript that uses a file lister and picker handler that requires:

use framework "Foundation"
use scripting additions

However, if I include these 2 lines in my script, I get an error when compiling:

And the error highlights the word “string” in the last line of the code pasted below:

considering case
		repeat with i from 1 to qdtiCount by 2 -- Parse odd-numbered items only.
			set thisBit to item i of o's qdti
			if ((count thisBit) > 0) or (i is qdtiCount) then
				-- This is either a non-empty string or the last item in the list, so it doesn't
				-- represent a quoted quote. Check if we've just been dealing with any.
				if (quoteInProgress) then
					-- All the parts of a quoted field containing quoted quotes have now been
					-- passed over. Coerce them together using a quote delimiter.
					set AppleScript's text item delimiters to "\""
					set thisField to (items a thru (i - 1) of o's qdti) as string

Do you know what the problem might be?

Thanks–

Nigel_Garvey · December 3, 2016, 10:47pm

No, in a word. I can’t reproduce the error.

Presumably it’s a clash with something else that’s been added to the script.

Shane_Stanley · December 4, 2016, 12:36am

What happens if you change “string” to “text”?

Yvan_Koenig · December 4, 2016, 9:54am

May be a good alternative to use Shane STANLEY’s BridgePlus.

use AppleScript version "2.4"
use scripting additions
use framework "Foundation"
use script "BridgePlus" # free from : https://www.macosxautomation.com/applescript/apps/BridgePlus.html
load framework

# use the comma as separator
set aString to "1,2,3" & linefeed & "4,\"5,00\",6" & linefeed & "7,8,9" & linefeed & "caiv2,2010BBDGRC,\"President, Board of Directors\""
set theResult to current application's SMSForder's arrayFromCSV:aString commaIs:","
set oops to ASify from theResult # if OS X 10.10
-->	{{"1", "2", "3"}, {"4", "5,00", "6"}, {"7", "8", "9"}, {"caiv2", "2010BBDGRC", "President, Board of Directors"}}
theResult as list # if OS X 10.11 or later
-->	{{"1", "2", "3"}, {"4", "5,00", "6"}, {"7", "8", "9"}, {"caiv2", "2010BBDGRC", "President, Board of Directors"}}

# use semi colon as separator
set aString to "1;2;3" & linefeed & "4;\"5;00\";6" & linefeed & "7;8;9" & linefeed & "caiv2;2010BBDGRC;\"President; Board of Directors\""
set theResult to current application's SMSForder's arrayFromCSV:aString commaIs:";"
ASify from theResult # if OS X 10.10
-->	{{"1", "2", "3"}, {"4", "5;00", "6"}, {"7", "8", "9"}, {"caiv2", "2010BBDGRC", "President; Board of Directors"}}
theResult as list # if OS X 10.11 or later
-->	{{"1", "2", "3"}, {"4", "5;00", "6"}, {"7", "8", "9"}, {"caiv2", "2010BBDGRC", "President; Board of Directors"}}