Alternative to Grep?

I was able to run a lengthy grep command in Terminal

with a very quick and successful response.

The shell equivalent in AppleScript failed, and crashed Script Debugger


	set grepSh to " grep -nRHIi   'claim number\\|control\\|page\\|patient name'" & quoted form of TargetDirectory & "| grep -iv 'packet\\|explanation of\\|refer\\|detach\\|external'| sort -n -t  - -k 2"
	set GrepPageResults to do shell script grepSh

A shell equivalent of a smaller grep command succeeded without crashing Script Debugger.


		set grepSh to "grep  -nRHIi  " & quoted form of "page" & space & quoted form of TargetDirectory & "| grep -iv refer\\|detach " & " | sort -n -t  - -k 2"
	set GrepPageResults to do shell script grepSh

It appears to me that my grep shell command has overloaded AppleScript. I looked at Shane’s RegexAndStuffLib, but could not find a method to find multiple items, exclude others and then sort by tabs.

I would appreciate any idea on methods to find or grep for multiple words, while excluding others and then to sort the results.

Hi.

It’s not clear what’s in the text you’re parsing. One thing you could try is to increase the level of backslash escaping in the shell script text. It can get rather complex with the need to escape the backslash in the string passed to the shell and to escape both of those backslashes in the AppleScript text representing the process! You may need to experiment:

	set grepSh to " grep -nRHIi   'claim number\\\\|control\\\\|page\\\\|patient name'" & quoted form of TargetDirectory & "| grep -iv 'packet\\\\|explanation of\\\\|refer\\\\|detach\\\\|external'| sort -n -t  - -k 2"
	set GrepPageResults to do shell script grepSh

Thanks Nigel for your help. My goal is to find files in a directory, in this case the Desktop directory, that contain a list of words and phrases, such as “claim number, control, page, and patient name” , while excluding other words and phrases such as “packet, explanation of, refer, detach and external”. After finding insensitive case words and the lines in which those words were located in the files, I piped the results to a unix sort function, so that I could sort the found set by the files in which they were located.

Regarding the multiple backslashes, I initially loaded my AppleScript with only two backslashes “\” which unfortunately expanded to “\\” when I uploaded it to Macscripter. I have re-uploaded the original script with only two backslashes required by AppleScript to allow the single backslash required in the grep command when run from Terminal.

set grepSh to " grep -nRHIi   'claim number\\|control\\|page\\|patient name'" & quoted form of TargetDirectory & "| grep -iv 'packet\\|explanation of\\|refer\\|detach\\|external'| sort -n -t  - -k 2"

I would like to find another method of finding multiple words in files of a folder, while excluding other words, and then sort the found data by the names of those files in that folder

Hi akim.

It looks as if you’d need six backslashes before each vertical bar. What I was trying to get across above was that with ‘do shell script’, the text sent to grep is a string within a string within a string:

The string parameter sent to grep, here containing “|”
The text of the shell script command which includes the grep string parameter. It seems to require both the backslash and the bar in the string to be escaped: “\|”
The AppleScript source code which produces the shell script text. In this, all three backslashes need to be escaped: “\\\|”

Separately, your first applescript (crashy) lacks a space before the directory, whereas your second applescript (non-crashy) has a space. What happens if you add a space there in the problematic script?

patient name’" & quoted form
→ patient name’'/Users

page" & space & quoted form
→ page’ '/Users

or:
patient name’ " & quoted form
→ patient name’ '/Users

Nigel, Thanks for the clarification of the backslash additions in AppleScript. This AppleScript modification is good to know.

Mockman, Thanks for finding the extra space that I erroneously added. I deleted that space, but unfortunately, the result did not change, with the grep shell script still causing ScriptDebugger to spin for a long time.

Peavine, Thanks for your Foundation framework script. It worked well to find those files that contained specific words and then sorted the files.

My goal is to analyze the matched lines of text for other words that might precede or follow the specified words.

Using the Foundation framework, how might I capture the entire line that contains the matched target words, similar to grep’s -n option?

The space needs to be there.

I’ve included a suggestion below. It prompts the user to select a text file and returns a string of lines that both contain and do not contain specified words. The timing result with a file that contained 1,651 lines was 11 milliseconds.

A few comments:

  • The script returns a string but is easily modified to return a list or array.

  • Line numbers can be added to the returned lines by setting a positional parameter to true.

  • The script ignores word boundaries when searching, but this can be changed by editing the include and exclude patterns to:

set includePattern to “(?im)^.(\b" & includeWords & "\b).$”
set excludePattern to “(?im)^.(\b" & excludeWords & "\b).$”

  • If you do not want the script to exclude lines containing specific words or phrases, simply enter characters that will not be found. For example:

set includeWords to “word one|word two”
set excludeWords to “xxxx”

  • If you want the script to return every line except those containing specific words or phrases, set includeWords to an empty string. For example:

set includeWords to “”
set excludeWords to “word three|word four”

  • If you want the search to be case sensitive, replace (?im) with (?m) in two places.
-- revised 2022.12.01

use framework "Foundation"
use scripting additions

set theFile to POSIX path of (choose file of type "txt")
set matchingLines to getMatchingLines(theFile, false) -- true will number lines

on getMatchingLines(theFile, lineNumbers)
	set includeWords to "word one|word two"
	set excludeWords to "word three|word four"
	set includePattern to "(?im)^.*(" & includeWords & ").*$"
	set excludePattern to "(?im)^.*(" & excludeWords & ").*$"
	
	if lineNumbers is true then
		set theString to do shell script "sed -E 's/'$'\\r''$//;s/'$'\\r''/\\'$'\\n''/g' " & quoted form of theFile & " | sed = | sed " & quoted form of "N;s/\\n/\\. /"
		set theString to current application's NSString's stringWithString:theString
	else
		set theString to current application's NSString's stringWithContentsOfFile:theFile encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
		set theString to theString's stringByReplacingOccurrencesOfString:(return & linefeed) withString:linefeed
	end if
	
	set theDelimiters to (current application's NSCharacterSet's newlineCharacterSet())
	set theArray to (theString's componentsSeparatedByCharactersInSet:theDelimiters)
	
	set includePredicate to current application's NSPredicate's predicateWithFormat_("(self MATCHES %@)", includePattern)
	set includeArray to (theArray's filteredArrayUsingPredicate:includePredicate)'s mutableCopy()
	set excludePredicate to current application's NSPredicate's predicateWithFormat_("(self MATCHES %@)", excludePattern)
	set excludeArray to theArray's filteredArrayUsingPredicate:excludePredicate
	includeArray's removeObjectsInArray:excludeArray
	return ((includeArray's componentsJoinedByString:linefeed) as text)
end getMatchingLines

Peavine, Thanks for the new AppleScript example. The Objective C lines of code have been very helpful.

The following script is similar to that in post 8, differing primarily in that it works on all files in a folder selected by the user. The script returns a string containing each file’s POSIX path (in name order), immediately followed by the matching lines. The timing result with a folder that contained 10 text files, each of which contained 1,651 lines was 110 milliseconds.

-- revised 2022.12.01

use framework "Foundation"
use scripting additions

set theFolder to POSIX path of (choose folder)
set theFiles to getFiles(theFolder)
set matchingData to getMatchingData(theFiles, false) -- true to number lines

on getFiles(theFolder)
	set fileManager to current application's NSFileManager's defaultManager()
	set theFolder to current application's |NSURL|'s fileURLWithPath:theFolder
	set folderContents to fileManager's contentsOfDirectoryAtURL:(theFolder) includingPropertiesForKeys:{} options:4 |error|:(missing value)
	set thePredicate to current application's NSPredicate's predicateWithFormat:"pathExtension ==[c] 'txt'"
	set theFiles to (folderContents's filteredArrayUsingPredicate:thePredicate)'s valueForKey:"path"
	return (theFiles's sortedArrayUsingSelector:"localizedStandardCompare:")
end getFiles

on getMatchingData(theFiles, lineNumbers)
	set includeWords to "word one|word two"
	set excludeWords to "word three|word four"
	set includePattern to "(?im)^.*(" & includeWords & ").*$"
	set excludePattern to "(?im)^.*(" & excludeWords & ").*$"
	
	set matchingData to current application's NSMutableArray's new()
	repeat with aFile in theFiles
		(matchingData's addObject:aFile)
		set matchingLines to getMatchingLines(aFile, includePattern, excludePattern, lineNumbers)
		if (matchingLines's isEqualToString:"") then set matchingLines to (matchingLines's stringByAppendingString:"** No matching lines were found **")
		set matchingLines to (matchingLines's stringByAppendingString:linefeed)
		(matchingData's addObject:matchingLines)
	end repeat
	return ((matchingData's componentsJoinedByString:linefeed) as text)
end getMatchingData

on getMatchingLines(theFile, includePattern, excludePattern, lineNumbers)
	if lineNumbers is true then
		set theString to do shell script "sed -E 's/'$'\\r''$//;s/'$'\\r''/\\'$'\\n''/g' " & quoted form of (theFile as text) & " | sed = | sed " & quoted form of "N;s/\\n/\\. /"
		set theString to current application's NSString's stringWithString:theString
	else
		set theString to current application's NSString's stringWithContentsOfFile:theFile encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
		set theString to theString's stringByReplacingOccurrencesOfString:(return & linefeed) withString:linefeed
	end if
	
	set theDelimiters to (current application's NSCharacterSet's newlineCharacterSet())
	set theArray to (theString's componentsSeparatedByCharactersInSet:theDelimiters)
	
	set includePredicate to current application's NSPredicate's predicateWithFormat_("(self MATCHES %@)", includePattern)
	set includeArray to (theArray's filteredArrayUsingPredicate:includePredicate)'s mutableCopy()
	set excludePredicate to current application's NSPredicate's predicateWithFormat_("(self MATCHES %@)", excludePattern)
	set excludeArray to theArray's filteredArrayUsingPredicate:excludePredicate
	includeArray's removeObjectsInArray:excludeArray
	return (includeArray's componentsJoinedByString:linefeed)
end getMatchingLines

Thank you peavine for your wonderful example of how to write Foundation framework’s AppleScript Objective C methods to find files containing specific words. I appreciate your sorting of the files, by file name, and have learned a lot from your example.

However, now that your script returns the found lines containing specific words, how might I parse through each file’s found lines, with a loop or otherwise to manipulate the found data, file by file.

I have attempted to return every paragraph of the returned result, but that attempt parsed the found items into separate paragraphs and removed the identifying file’s path from which the words were found.

Is it possible to return the found items as an array or list , in which AppleScript could then parse through each found file’s items, so that each file’s path and its found lines containing the included words could be further analyzed?

Expressed in a different manner, is it possible to return an array of arrays, rather than an array of lines separated by linefeeds. If so, is it possible to sort that array of arrays by file path name?

I am at a loss, and would appreciate more insight or direction. Thank you in advance.

akim. I’ve revised my script to return an array of arrays. The subarrays will be sorted by file name, and each subarray will contain two items–the file path and an array of matching lines.

A slight alternative to the above is to coerce the array of arrays to a list of lists, which can then be analyzed with basic AppleScript.

-- revised 2022.12.01

use framework "Foundation"
use scripting additions

set theFolder to POSIX path of (choose folder)
set theFiles to getFiles(theFolder)
set matchingData to getMatchingData(theFiles, false) -- true to number lines

on getFiles(theFolder)
	set fileManager to current application's NSFileManager's defaultManager()
	set theFolder to current application's |NSURL|'s fileURLWithPath:theFolder
	set folderContents to fileManager's contentsOfDirectoryAtURL:(theFolder) includingPropertiesForKeys:{} options:4 |error|:(missing value)
	set thePredicate to current application's NSPredicate's predicateWithFormat:"pathExtension ==[c] 'txt'"
	set theFiles to (folderContents's filteredArrayUsingPredicate:thePredicate)'s valueForKey:"path"
	return (theFiles's sortedArrayUsingSelector:"localizedStandardCompare:")
end getFiles

on getMatchingData(theFiles, lineNumbers)
	set includeWords to "word one|word two"
	set excludeWords to "word three|word four"
	set includePattern to "(?im)^.*(" & includeWords & ").*$"
	set excludePattern to "(?im)^.*(" & excludeWords & ").*$"
	
	set matchingData to current application's NSMutableArray's new()
	repeat with aFile in theFiles
		set anArray to current application's NSMutableArray's new()
		(anArray's addObject:aFile)
		set matchingLines to getMatchingLines(aFile, includePattern, excludePattern, lineNumbers)
		(anArray's addObject:matchingLines)
		(matchingData's addObject:anArray)
	end repeat
	return matchingData -- coerce to list if desired
end getMatchingData

on getMatchingLines(theFile, includePattern, excludePattern, lineNumbers)
	if lineNumbers is true then
		set theString to do shell script "sed -E 's/'$'\\r''$//;s/'$'\\r''/\\'$'\\n''/g' " & quoted form of (theFile as text) & " | sed = | sed " & quoted form of "N;s/\\n/\\. /"
		set theString to current application's NSString's stringWithString:theString
	else
		set theString to current application's NSString's stringWithContentsOfFile:theFile encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
		set theString to theString's stringByReplacingOccurrencesOfString:(return & linefeed) withString:linefeed
	end if
	
	set theDelimiters to (current application's NSCharacterSet's newlineCharacterSet())
	set theArray to (theString's componentsSeparatedByCharactersInSet:theDelimiters)
	
	set includePredicate to current application's NSPredicate's predicateWithFormat_("(self MATCHES %@)", includePattern)
	set includeArray to (theArray's filteredArrayUsingPredicate:includePredicate)'s mutableCopy()
	set excludePredicate to current application's NSPredicate's predicateWithFormat_("(self MATCHES %@)", excludePattern)
	set excludeArray to theArray's filteredArrayUsingPredicate:excludePredicate
	(includeArray's removeObjectsInArray:excludeArray)
	return includeArray
end getMatchingLines

If you coerce the array of arrays to a list of lists, you could do the analysis as follows:

repeat with aList in matchingData
	set aList to contents of aList
	set theFile to item 1 of aList
	set matchingLines to item 2 of aList
	repeat with aLine in matchingLines
		set aLine to contents of aLine
		-- analyze aLine
	end repeat
end repeat

Peavine, your script worked well and accomplished the task very well! Once again, I thank you for your great ideas and help.

Peavine, I am attempting to match lines that have a certain word, such as “page” and then include the rest of that line continuing to include any subsequent lines that have no text or digits until the the command matches a line starting with digits. For example, I am attempting to capture the lines

I added

to locate patterns of zero or more spaces, returns or newlines, followed by

for any character except a new line, followed by

for any one or more digits.

set includeWords to "page"
set includePattern to "(?im)^(" & includeWords & ").*[\\s\\r\\n]*.*[\\d]*"

This however failed and yielded an empty match.
Is there a way to utilize NSPredicate’s predicateWithFormat with a regular expression to match the includeWords followed by one or more empty lines, and then ending with a line starting with digits?

akim. I don’t know of a way to do what you want with predicateWithFormat.

Why so many ? Have you observed that “do shell…” in AppleScript uses sh shell, while terminal uses zsh shell.