Using grep & regex for pattern matching in CSV lines

DJ_Bazzie_Wazzie · September 13, 2016, 5:34pm

For the record: It is because HTML doesn’t support tabs so they are spaces in the presentation. Just click the “open this scriptlet in your editor” link so tabs are preserved.

ccstone · September 14, 2016, 1:26am

Hey Folks,

The following awk script will return the found line preceded by the found line number and a tab:

[format]5 2016 5671 Pk 180 pk 155 This is 1 Example 1,3 16,5:18
[/format]

Do NOT use copy to capture the scripts, because literal tabs will NOT be copied “ use the “Open this Scriplet in your Editor:” link.

Using DJ’s 5 line test text in a file to test with.

(Change the file path to work properly with your test file.)

Basic shell script:


FILE=~/Downloads/test.txt
awk '/^[^	]+	[^	]+	+Pk 180/{ print NR "	" $0}' "$FILE"

NOTE: The findPattern variable is really the literal text to be found in column 3.

Shell script with find-pattern and file separated out:


findPattern='Pk 180';
FILE=~/Downloads/test.txt;
shCMD="awk '/^[^	]+	[^	]+	+""$findPattern""/{ print NR \"	\" "'$0'"}' \"$FILE\""
eval "$shCMD"

Shell script with find-pattern and file separated out converted to AppleScript:


set shCMD to "
findPattern='Pk 180';
FILE=~/Downloads/test.txt;
shCMD=\"awk '/^[^	]+	[^	]+	+\"\"$findPattern\"\"/{ print NR \\\"	\\\" \"'$0'\"}' \\\"$FILE\\\"\"
eval \"$shCMD\"
"
set theResult to do shell script shCMD

–
Chris

{ MacBookPro6,1 · 2.66 GHz Intel Core i7 · 8GB RAM · OSX 10.11.6 }
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

RAE3000 · September 14, 2016, 10:21am

I had some problems with opening the Scriplets form Firefox. Switching to Safari fixed this issue.

Now I could check all the codes. Thanks a lot they all work great.

But it seems if I read in my file the way I do the tabs go missing (?, i think) and nothing is found anymore.

My code for reading in the file


set theCSV to "/Users/RAE/Desktop/TEST.csv"

set f to read theCSV 
--skip first line with headers
set all_records to text (paragraph 2) thru -1 of f

In the script from ccstone it works fine with the file, but my way of reading the contents of the file for using DJs & NGs code with my CSV file seems not to work.

Nigel_Garvey · September 14, 2016, 11:23am

It’s hard to tell without seeing your actual file. With the ‘read’ command, you have to know the format of the text in the file ” eg.ASCII, UTF-8, UTF-16 ” in order for it to be interpreted correctly. Some shell programs only recognise linefeeds as line separators.

Given what you seem to want to do, it may be better to have the file read directly by the shell script. For instance, modifying my script:

set sourceFile to POSIX path of (choose file)
set searchString to "Pk 155"

-- The first 'sed' reads the file and cuts the first line (provided that the line delimiters are linefeeds). The 'grep' locates the required line(s) and prepends the line numbers. The second 'sed' splits each numbered line into number and line.
set searchResult to (do shell script "sed -n '2,$ p' " & quoted form of sourceFile & " | grep -n " & quoted form of searchString & " | sed 's/:/\\'$'\\n''/'")
if ((count searchResult) > 0) then
	set {lineNumber, wholeLine} to paragraphs of searchResult
else
	set {lineNumber, wholeLine} to {"", ""}
end if

RAE3000 · September 14, 2016, 12:19pm

The file was exported from a OpenOffice Calc Document.
Encoded utf8.

@NG
Just tried your script with the file. Works like a charm.

DJ, Nigel, ccstone thank you all so very much. This solved a big problem I had with the script.
And I am sure I will use your snippets it in more scripts to come.

Nigel_Garvey · September 14, 2016, 1:57pm

Hey RAE3000.

For some reason, I adapted the general-search script. Here’s the third-column-only version:

set sourceFile to POSIX path of (choose file)
set searchString to "Pk 155"

set searchResult to (do shell script "sed -n '2,$ p' " & quoted form of sourceFile & " | egrep -n " & quoted form of ("^([^	]*	){2}" & searchString & tab) & " | sed 's/:/\\'$'\\n''/'")
if ((count searchResult) > 0) then
	set {lineNumber, wholeLine} to paragraphs of searchResult
else
	set {lineNumber, wholeLine} to {"", ""}
end if

Many apologies.

RAE3000 · September 14, 2016, 3:07pm

@NG
ah thats why I had false positives. Thanks for fixing that.

My script just got way faster and also more flexible.

Nigel_Garvey · September 14, 2016, 4:13pm

Done.

ccstone · September 17, 2016, 9:56pm

A simplified version of the AppleScript


set findStr to "Pk 180"

set filePath to "~/Downloads/test.csv"
tell application "System Events" to set filePath to POSIX path of disk item filePath

# Read the data.
set fileContent to quoted form of (read filePath as «class utf8»)

# Now removes header line.
set theResult to do shell script "<<<" & fileContent & " awk 'NR == 1 {next} {print}' | awk '/^[^	]+	[^	]+	+" & findStr & "/{ print NR \"\\n\" $0}'"

if theResult â‰  "" then
	set {lineNum, lineContent} to paragraphs of theResult
else
	set {lineNum, lineContent} to {"", ""}
end if

RAE3000 · September 18, 2016, 7:10pm

Thank you ccstone for the update.
Also learned a new way of reading in a file.

DJ_Bazzie_Wazzie · September 19, 2016, 12:20pm

Here an more efficient way


set searchString to "pk 155"
set myFile to quoted form of POSIX path of (path to desktop folder) & "test.csv"
set pattern to quoted form of ("^(?:[^	]+)	(?:[^	]+)	(" & searchString & ")	.*$")
set matches to do shell script "tail -n +2 " & myFile & " | egrep -in " & pattern & " || true"

The script assumes that the file is located on your desktop with the name “test.csv”
The ‘tail -n +2’ will print every line while reading the file starting from the second row
In egrep we use the option -i to make the match case insensitive
At the end we append an or operator (||) so when egrep throws an error the true command is executed instead; when there is no match an empty string will be returned and no longer an error.

Advantage over my previous post is that the data is read an send to egrep right away. This way we don’t need a large buffer and send it to bash, also the tail command itself uses no buffers and is very fast. I prefer grep and egrep over other regex commands because grep (search only) needs no buffer and is up to 10 times faster than the other tools (search and replace). Performance difference depend heavily on the expression itself of course but grep and egrep can process characters 5 times faster. Between sed and awk (and python) there is no real big difference in performance. It should be personal preference because the awk language is based on the C syntax while sed has more like his own code based on ed. I don’t know how important performance is and I don’t know how many more lines than 5000 the file actually contains but the solution above is built for speed.

ccstone · September 19, 2016, 9:02pm

Hey DJ,

Yes, but you’re failing to get the line number as the OP wanted.

You’re also forgetting the OP complained that my original script didn’t let him read his data the way he wanted to. (I read the file directly into awk.)

My script should be one awk script instead of two, but my awk-fu isn’t quite there yet.

-Chris

Shane_Stanley · September 20, 2016, 12:04am

But I fear do shell script lets it down. For kicks I made a 5000 line sample containing 20 matches. I compared your script with this:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set searchString to "Pka 155"
set theText to current application's NSString's stringWithContentsOfFile:(POSIX path of (path to desktop folder) & "Sample.txt") encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:("^(?:[^\\t]+)\\t(?:[^\\t]+)\\t(" & searchString & ")\\t.*$") options:(current application's NSRegularExpressionAnchorsMatchLines) |error|:(missing value)
set theFinds to theRegex's matchesInString:theText options:0 range:{0, theText's |length|()}
set theResults to current application's NSMutableArray's array()
repeat with aFind in theFinds
	(theResults's addObject:(theText's substringWithRange:(aFind's range())))
end repeat
(theResults's componentsJoinedByString:linefeed) as text

Script Geek suggests the shell version takes about 6 times longer.

Of course I’m cherry-picking; as the number of matches goes up, the advantage tips the other way – dramatically. But for a modest number of finds the ASObjC method is considerably quicker than the overhead of any call to do shell script, regardless of how fast the actual shell code is.

(and yes, this too is failing to get the line numbers.)

Nigel_Garvey · September 20, 2016, 7:19am

on getMatchingLines(sourceFile, searchString, delimiter, searchColumn)
	script o
		property lineList : paragraphs of (read sourceFile as «class utf8»)
	end script
	
	set astid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to delimiter
	repeat with i from 2 to (count o's lineList)
		set thisLine to item i of o's lineList
		try
			if (text item searchColumn of thisLine is searchString) then set item i of o's lineList to {i - 1, thisLine}
		end try
	end repeat
	set AppleScript's text item delimiters to astid
	
	return o's lineList's lists
end getMatchingLines

set sourceFile to ((path to desktop as text) & "Test.csv") as alias
set searchString to "Pk 155"
set delimiter to tab
set searchColumn to 3

getMatchingLines(sourceFile, searchString, delimiter, searchColumn)

Shane_Stanley · September 20, 2016, 7:29am

Nigel,

How quaint And fast…

Nigel_Garvey · September 20, 2016, 8:02am

The delimiter parameter’s just a bit of dressing. Like all the solutions, the script’s not suitable for a general CSV use as is doesn’t take quoted fields into account.

DJ_Bazzie_Wazzie · September 20, 2016, 8:55am

Yes, that the whole problem of shell script (firing an AppleEvent and forking a new process and starting up bash in sh mode). The overhead of a do shell script is huge I agree but I was thinking more like tens of thousands of lines and thousands of matches. At least it’s the kind of of CSV data we handle on a daily basis (300k is not uncommon). Still “more than 5000 lines” can be 5001 till infinite.

I’m sorry, I missed that part where the OP wanted that. The option -n prints the line line number as well with grep it’s no biggie ;). I updated the code above.

I can help you if you want because for search and replace regex I like awk and use it a lot.

Actually to be correct this is TSV data. Unlike CSV data in TSV data it is not allowed to have a field delimiter in the values, so there is no quoting or escaping involved in TSV data. Otherwise AppleScript Toolbox is a great tool to read CSV data quoted and unquoted fields right into an 2d list

ccstone · September 20, 2016, 3:08pm

Hey DJ,

So it does. I’d forgotten that option.

I’d be delighted if you would. I need to expand my vocabulary.

-Chris

DJ_Bazzie_Wazzie · September 20, 2016, 11:10pm

Hello Chris,

Here is a nice AWK-ward article by Apple explaining a bit more than the man page in the terminal. While most Linux and Unix users uses gawk which has much more function and environment variables, Apple’s Kernighan implementation has less.

The beauty of AWK is that there is a constructor and destructor function, user defined function support and multiple expressions for each line which will be all executed, match or no match. I saw that you already used the right next instruction which will be powerful in our situation.

The most verbose form of the script:
[format]# constructor
BEGIN {
# set field separator to tab
FS=" "
}

expression: If line number is 1.

NR == 1 {
# The ‘next’ instruction will be invoked immediately.
# The remaining expression(s) won’t be executed after next.
next
}

expression: If 3rd field matches regex.

$3 ~ /^Pk 155$/ {
# print the line number and line in the same format as (e)grep -n
print NR":"$0
}[/format]

compressed:
[format]BEGIN {FS=" "} NR == 1 {next} $3 ~ /^Pk 155$/ {print NR":"$0}[/format]

For an exact field match (hence the regex start with caret, end with dollar and literal string in between) we can perform an string comparison. So an exact string match we can use the == operator:
[format]BEGIN {FS=" "} NR == 1 {next} $3 == "Pk 155" {print NR":"$0}[/format]

There is no support for case insensitive regex or string match in awk but there is the function tolower() to make a case insensitive match. Just make sure that the string you match against is lower case as well.
[format]BEGIN {FS=" "} NR == 1 {next} tolower($3) == "pk 155" {print NR":"$0}[/format]

Last but not least, in AppleScript with using a regex case insensitive and assuming the test.csv file is located on the desktop as in my own previous example the script would look like:

set awkScript to "BEGIN {FS=\"	\"} NR == 1 {next} tolower($3) ~ /^pk 155$/ {print NR\":\"$0}"
set myFile to quoted form of POSIX path of (path to desktop folder) & "test.csv"
do shell script "awk " & quoted form of awkScript & space & myFile

ccstone · September 22, 2016, 2:57pm

Hey DJ,

Thanks for the lesson.

I get all of it but need to work through it until it becomes part of my functional vocabulary.

I have gawk on the machine but like to use the stock awk for portability.

I haven’t looked at the Shell-Scripting-Primer in a good while, so thanks for reminding me of “How AWK-ward”.

-Chris