But I fear do shell script lets it down. For kicks I made a 5000 line sample containing 20 matches. I compared your script with this:
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions
set searchString to "Pka 155"
set theText to current application's NSString's stringWithContentsOfFile:(POSIX path of (path to desktop folder) & "Sample.txt") encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:("^(?:[^\\t]+)\\t(?:[^\\t]+)\\t(" & searchString & ")\\t.*$") options:(current application's NSRegularExpressionAnchorsMatchLines) |error|:(missing value)
set theFinds to theRegex's matchesInString:theText options:0 range:{0, theText's |length|()}
set theResults to current application's NSMutableArray's array()
repeat with aFind in theFinds
(theResults's addObject:(theText's substringWithRange:(aFind's range())))
end repeat
(theResults's componentsJoinedByString:linefeed) as text
Script Geek suggests the shell version takes about 6 times longer.
Of course I’m cherry-picking; as the number of matches goes up, the advantage tips the other way – dramatically. But for a modest number of finds the ASObjC method is considerably quicker than the overhead of any call to do shell script, regardless of how fast the actual shell code is.
(and yes, this too is failing to get the line numbers.)
on getMatchingLines(sourceFile, searchString, delimiter, searchColumn)
script o
property lineList : paragraphs of (read sourceFile as «class utf8»)
end script
set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to delimiter
repeat with i from 2 to (count o's lineList)
set thisLine to item i of o's lineList
try
if (text item searchColumn of thisLine is searchString) then set item i of o's lineList to {i - 1, thisLine}
end try
end repeat
set AppleScript's text item delimiters to astid
return o's lineList's lists
end getMatchingLines
set sourceFile to ((path to desktop as text) & "Test.csv") as alias
set searchString to "Pk 155"
set delimiter to tab
set searchColumn to 3
getMatchingLines(sourceFile, searchString, delimiter, searchColumn)
The delimiter parameter’s just a bit of dressing. Like all the solutions, the script’s not suitable for a general CSV use as is doesn’t take quoted fields into account.
Yes, that the whole problem of shell script (firing an AppleEvent and forking a new process and starting up bash in sh mode). The overhead of a do shell script is huge I agree but I was thinking more like tens of thousands of lines and thousands of matches. At least it’s the kind of of CSV data we handle on a daily basis (300k is not uncommon). Still “more than 5000 lines” can be 5001 till infinite.
I’m sorry, I missed that part where the OP wanted that. The option -n prints the line line number as well with grep it’s no biggie ;). I updated the code above.
I can help you if you want because for search and replace regex I like awk and use it a lot.
Actually to be correct this is TSV data. Unlike CSV data in TSV data it is not allowed to have a field delimiter in the values, so there is no quoting or escaping involved in TSV data. Otherwise AppleScript Toolbox is a great tool to read CSV data quoted and unquoted fields right into an 2d list
Here is a nice AWK-ward article by Apple explaining a bit more than the man page in the terminal. While most Linux and Unix users uses gawk which has much more function and environment variables, Apple’s Kernighan implementation has less.
The beauty of AWK is that there is a constructor and destructor function, user defined function support and multiple expressions for each line which will be all executed, match or no match. I saw that you already used the right next instruction which will be powerful in our situation.
The most verbose form of the script:
[format]# constructor
BEGIN {
# set field separator to tab
FS=" "
}
expression: If line number is 1.
NR == 1 {
# The ‘next’ instruction will be invoked immediately.
# The remaining expression(s) won’t be executed after next.
next
}
expression: If 3rd field matches regex.
$3 ~ /^Pk 155$/ {
# print the line number and line in the same format as (e)grep -n
print NR":"$0
}[/format]
For an exact field match (hence the regex start with caret, end with dollar and literal string in between) we can perform an string comparison. So an exact string match we can use the == operator:
[format]BEGIN {FS=" "} NR == 1 {next} $3 == "Pk 155" {print NR":"$0}[/format]
There is no support for case insensitive regex or string match in awk but there is the function tolower() to make a case insensitive match. Just make sure that the string you match against is lower case as well.
[format]BEGIN {FS=" "} NR == 1 {next} tolower($3) == "pk 155" {print NR":"$0}[/format]
Last but not least, in AppleScript with using a regex case insensitive and assuming the test.csv file is located on the desktop as in my own previous example the script would look like:
set awkScript to "BEGIN {FS=\" \"} NR == 1 {next} tolower($3) ~ /^pk 155$/ {print NR\":\"$0}"
set myFile to quoted form of POSIX path of (path to desktop folder) & "test.csv"
do shell script "awk " & quoted form of awkScript & space & myFile