Using grep & regex for pattern matching in CSV lines

Shane_Stanley · September 20, 2016, 12:04am

But I fear do shell script lets it down. For kicks I made a 5000 line sample containing 20 matches. I compared your script with this:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set searchString to "Pka 155"
set theText to current application's NSString's stringWithContentsOfFile:(POSIX path of (path to desktop folder) & "Sample.txt") encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:("^(?:[^\\t]+)\\t(?:[^\\t]+)\\t(" & searchString & ")\\t.*$") options:(current application's NSRegularExpressionAnchorsMatchLines) |error|:(missing value)
set theFinds to theRegex's matchesInString:theText options:0 range:{0, theText's |length|()}
set theResults to current application's NSMutableArray's array()
repeat with aFind in theFinds
	(theResults's addObject:(theText's substringWithRange:(aFind's range())))
end repeat
(theResults's componentsJoinedByString:linefeed) as text

Script Geek suggests the shell version takes about 6 times longer.

Of course I’m cherry-picking; as the number of matches goes up, the advantage tips the other way – dramatically. But for a modest number of finds the ASObjC method is considerably quicker than the overhead of any call to do shell script, regardless of how fast the actual shell code is.

(and yes, this too is failing to get the line numbers.)

Nigel_Garvey · September 20, 2016, 7:19am

on getMatchingLines(sourceFile, searchString, delimiter, searchColumn)
	script o
		property lineList : paragraphs of (read sourceFile as «class utf8»)
	end script
	
	set astid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to delimiter
	repeat with i from 2 to (count o's lineList)
		set thisLine to item i of o's lineList
		try
			if (text item searchColumn of thisLine is searchString) then set item i of o's lineList to {i - 1, thisLine}
		end try
	end repeat
	set AppleScript's text item delimiters to astid
	
	return o's lineList's lists
end getMatchingLines

set sourceFile to ((path to desktop as text) & "Test.csv") as alias
set searchString to "Pk 155"
set delimiter to tab
set searchColumn to 3

getMatchingLines(sourceFile, searchString, delimiter, searchColumn)

Shane_Stanley · September 20, 2016, 7:29am

Nigel,

How quaint And fast…

Nigel_Garvey · September 20, 2016, 8:02am

The delimiter parameter’s just a bit of dressing. Like all the solutions, the script’s not suitable for a general CSV use as is doesn’t take quoted fields into account.

DJ_Bazzie_Wazzie · September 20, 2016, 8:55am

Yes, that the whole problem of shell script (firing an AppleEvent and forking a new process and starting up bash in sh mode). The overhead of a do shell script is huge I agree but I was thinking more like tens of thousands of lines and thousands of matches. At least it’s the kind of of CSV data we handle on a daily basis (300k is not uncommon). Still “more than 5000 lines” can be 5001 till infinite.

I’m sorry, I missed that part where the OP wanted that. The option -n prints the line line number as well with grep it’s no biggie ;). I updated the code above.

I can help you if you want because for search and replace regex I like awk and use it a lot.

Actually to be correct this is TSV data. Unlike CSV data in TSV data it is not allowed to have a field delimiter in the values, so there is no quoting or escaping involved in TSV data. Otherwise AppleScript Toolbox is a great tool to read CSV data quoted and unquoted fields right into an 2d list

ccstone · September 20, 2016, 3:08pm

Hey DJ,

So it does. I’d forgotten that option.

I’d be delighted if you would. I need to expand my vocabulary.

-Chris

DJ_Bazzie_Wazzie · September 20, 2016, 11:10pm

Hello Chris,

Here is a nice AWK-ward article by Apple explaining a bit more than the man page in the terminal. While most Linux and Unix users uses gawk which has much more function and environment variables, Apple’s Kernighan implementation has less.

The beauty of AWK is that there is a constructor and destructor function, user defined function support and multiple expressions for each line which will be all executed, match or no match. I saw that you already used the right next instruction which will be powerful in our situation.

The most verbose form of the script:
[format]# constructor
BEGIN {
# set field separator to tab
FS=" "
}

expression: If line number is 1.

NR == 1 {
# The ‘next’ instruction will be invoked immediately.
# The remaining expression(s) won’t be executed after next.
next
}

expression: If 3rd field matches regex.

$3 ~ /^Pk 155$/ {
# print the line number and line in the same format as (e)grep -n
print NR":"$0
}[/format]

compressed:
[format]BEGIN {FS=" "} NR == 1 {next} $3 ~ /^Pk 155$/ {print NR":"$0}[/format]

For an exact field match (hence the regex start with caret, end with dollar and literal string in between) we can perform an string comparison. So an exact string match we can use the == operator:
[format]BEGIN {FS=" "} NR == 1 {next} $3 == "Pk 155" {print NR":"$0}[/format]

There is no support for case insensitive regex or string match in awk but there is the function tolower() to make a case insensitive match. Just make sure that the string you match against is lower case as well.
[format]BEGIN {FS=" "} NR == 1 {next} tolower($3) == "pk 155" {print NR":"$0}[/format]

Last but not least, in AppleScript with using a regex case insensitive and assuming the test.csv file is located on the desktop as in my own previous example the script would look like:

set awkScript to "BEGIN {FS=\"	\"} NR == 1 {next} tolower($3) ~ /^pk 155$/ {print NR\":\"$0}"
set myFile to quoted form of POSIX path of (path to desktop folder) & "test.csv"
do shell script "awk " & quoted form of awkScript & space & myFile

ccstone · September 22, 2016, 2:57pm

Hey DJ,

Thanks for the lesson.

I get all of it but need to work through it until it becomes part of my functional vocabulary.

I have gawk on the machine but like to use the stock awk for portability.

I haven’t looked at the Shell-Scripting-Primer in a good while, so thanks for reminding me of “How AWK-ward”.

-Chris

RAE3000 · September 22, 2016, 6:21pm

Wow this thread really grew. Guess I have to do a little read up session.
This forum is great place! I am earning a lot from you all.