I was able to run a lengthy grep command in Terminal
with a very quick and successful response.
The shell equivalent in AppleScript failed, and crashed Script Debugger
set grepSh to " grep -nRHIi 'claim number\\|control\\|page\\|patient name'" & quoted form of TargetDirectory & "| grep -iv 'packet\\|explanation of\\|refer\\|detach\\|external'| sort -n -t - -k 2"
set GrepPageResults to do shell script grepSh
A shell equivalent of a smaller grep command succeeded without crashing Script Debugger.
set grepSh to "grep -nRHIi " & quoted form of "page" & space & quoted form of TargetDirectory & "| grep -iv refer\\|detach " & " | sort -n -t - -k 2"
set GrepPageResults to do shell script grepSh
It appears to me that my grep shell command has overloaded AppleScript. I looked at Shane’s RegexAndStuffLib, but could not find a method to find multiple items, exclude others and then sort by tabs.
I would appreciate any idea on methods to find or grep for multiple words, while excluding others and then to sort the results.
It’s not clear what’s in the text you’re parsing. One thing you could try is to increase the level of backslash escaping in the shell script text. It can get rather complex with the need to escape the backslash in the string passed to the shell and to escape both of those backslashes in the AppleScript text representing the process! You may need to experiment:
set grepSh to " grep -nRHIi 'claim number\\\\|control\\\\|page\\\\|patient name'" & quoted form of TargetDirectory & "| grep -iv 'packet\\\\|explanation of\\\\|refer\\\\|detach\\\\|external'| sort -n -t - -k 2"
set GrepPageResults to do shell script grepSh
Thanks Nigel for your help. My goal is to find files in a directory, in this case the Desktop directory, that contain a list of words and phrases, such as “claim number, control, page, and patient name” , while excluding other words and phrases such as “packet, explanation of, refer, detach and external”. After finding insensitive case words and the lines in which those words were located in the files, I piped the results to a unix sort function, so that I could sort the found set by the files in which they were located.
Regarding the multiple backslashes, I initially loaded my AppleScript with only two backslashes “\” which unfortunately expanded to “\\” when I uploaded it to Macscripter. I have re-uploaded the original script with only two backslashes required by AppleScript to allow the single backslash required in the grep command when run from Terminal.
set grepSh to " grep -nRHIi 'claim number\\|control\\|page\\|patient name'" & quoted form of TargetDirectory & "| grep -iv 'packet\\|explanation of\\|refer\\|detach\\|external'| sort -n -t - -k 2"
I would like to find another method of finding multiple words in files of a folder, while excluding other words, and then sort the found data by the names of those files in that folder
It looks as if you’d need six backslashes before each vertical bar. What I was trying to get across above was that with ‘do shell script’, the text sent to grep is a string within a string within a string:
The string parameter sent to grep, here containing “|”
The text of the shell script command which includes the grep string parameter. It seems to require both the backslash and the bar in the string to be escaped: “\|”
The AppleScript source code which produces the shell script text. In this, all three backslashes need to be escaped: “\\\|”
Separately, your first applescript (crashy) lacks a space before the directory, whereas your second applescript (non-crashy) has a space. What happens if you add a space there in the problematic script?
patient name’" & quoted form
→ patient name’'/Users
Nigel, Thanks for the clarification of the backslash additions in AppleScript. This AppleScript modification is good to know.
Mockman, Thanks for finding the extra space that I erroneously added. I deleted that space, but unfortunately, the result did not change, with the grep shell script still causing ScriptDebugger to spin for a long time.
Peavine, Thanks for your Foundation framework script. It worked well to find those files that contained specific words and then sorted the files.
My goal is to analyze the matched lines of text for other words that might precede or follow the specified words.
Using the Foundation framework, how might I capture the entire line that contains the matched target words, similar to grep’s -n option?
I’ve included a suggestion below. It prompts the user to select a text file and returns a string of lines that both contain and do not contain specified words. The timing result with a file that contained 1,651 lines was 11 milliseconds.
A few comments:
The script returns a string but is easily modified to return a list or array.
Line numbers can be added to the returned lines by setting a positional parameter to true.
The script ignores word boundaries when searching, but this can be changed by editing the include and exclude patterns to:
set includePattern to “(?im)^.(\b" & includeWords & "\b).$”
set excludePattern to “(?im)^.(\b" & excludeWords & "\b).$”
If you do not want the script to exclude lines containing specific words or phrases, simply enter characters that will not be found. For example:
set includeWords to “word one|word two”
set excludeWords to “xxxx”
If you want the script to return every line except those containing specific words or phrases, set includeWords to an empty string. For example:
set includeWords to “”
set excludeWords to “word three|word four”
If you want the search to be case sensitive, replace (?im) with (?m) in two places.
-- revised 2022.12.01
use framework "Foundation"
use scripting additions
set theFile to POSIX path of (choose file of type "txt")
set matchingLines to getMatchingLines(theFile, false) -- true will number lines
on getMatchingLines(theFile, lineNumbers)
set includeWords to "word one|word two"
set excludeWords to "word three|word four"
set includePattern to "(?im)^.*(" & includeWords & ").*$"
set excludePattern to "(?im)^.*(" & excludeWords & ").*$"
if lineNumbers is true then
set theString to do shell script "sed -E 's/'$'\\r''$//;s/'$'\\r''/\\'$'\\n''/g' " & quoted form of theFile & " | sed = | sed " & quoted form of "N;s/\\n/\\. /"
set theString to current application's NSString's stringWithString:theString
else
set theString to current application's NSString's stringWithContentsOfFile:theFile encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
set theString to theString's stringByReplacingOccurrencesOfString:(return & linefeed) withString:linefeed
end if
set theDelimiters to (current application's NSCharacterSet's newlineCharacterSet())
set theArray to (theString's componentsSeparatedByCharactersInSet:theDelimiters)
set includePredicate to current application's NSPredicate's predicateWithFormat_("(self MATCHES %@)", includePattern)
set includeArray to (theArray's filteredArrayUsingPredicate:includePredicate)'s mutableCopy()
set excludePredicate to current application's NSPredicate's predicateWithFormat_("(self MATCHES %@)", excludePattern)
set excludeArray to theArray's filteredArrayUsingPredicate:excludePredicate
includeArray's removeObjectsInArray:excludeArray
return ((includeArray's componentsJoinedByString:linefeed) as text)
end getMatchingLines
The following script is similar to that in post 8, differing primarily in that it works on all files in a folder selected by the user. The script returns a string containing each file’s POSIX path (in name order), immediately followed by the matching lines. The timing result with a folder that contained 10 text files, each of which contained 1,651 lines was 110 milliseconds.
-- revised 2022.12.01
use framework "Foundation"
use scripting additions
set theFolder to POSIX path of (choose folder)
set theFiles to getFiles(theFolder)
set matchingData to getMatchingData(theFiles, false) -- true to number lines
on getFiles(theFolder)
set fileManager to current application's NSFileManager's defaultManager()
set theFolder to current application's |NSURL|'s fileURLWithPath:theFolder
set folderContents to fileManager's contentsOfDirectoryAtURL:(theFolder) includingPropertiesForKeys:{} options:4 |error|:(missing value)
set thePredicate to current application's NSPredicate's predicateWithFormat:"pathExtension ==[c] 'txt'"
set theFiles to (folderContents's filteredArrayUsingPredicate:thePredicate)'s valueForKey:"path"
return (theFiles's sortedArrayUsingSelector:"localizedStandardCompare:")
end getFiles
on getMatchingData(theFiles, lineNumbers)
set includeWords to "word one|word two"
set excludeWords to "word three|word four"
set includePattern to "(?im)^.*(" & includeWords & ").*$"
set excludePattern to "(?im)^.*(" & excludeWords & ").*$"
set matchingData to current application's NSMutableArray's new()
repeat with aFile in theFiles
(matchingData's addObject:aFile)
set matchingLines to getMatchingLines(aFile, includePattern, excludePattern, lineNumbers)
if (matchingLines's isEqualToString:"") then set matchingLines to (matchingLines's stringByAppendingString:"** No matching lines were found **")
set matchingLines to (matchingLines's stringByAppendingString:linefeed)
(matchingData's addObject:matchingLines)
end repeat
return ((matchingData's componentsJoinedByString:linefeed) as text)
end getMatchingData
on getMatchingLines(theFile, includePattern, excludePattern, lineNumbers)
if lineNumbers is true then
set theString to do shell script "sed -E 's/'$'\\r''$//;s/'$'\\r''/\\'$'\\n''/g' " & quoted form of (theFile as text) & " | sed = | sed " & quoted form of "N;s/\\n/\\. /"
set theString to current application's NSString's stringWithString:theString
else
set theString to current application's NSString's stringWithContentsOfFile:theFile encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
set theString to theString's stringByReplacingOccurrencesOfString:(return & linefeed) withString:linefeed
end if
set theDelimiters to (current application's NSCharacterSet's newlineCharacterSet())
set theArray to (theString's componentsSeparatedByCharactersInSet:theDelimiters)
set includePredicate to current application's NSPredicate's predicateWithFormat_("(self MATCHES %@)", includePattern)
set includeArray to (theArray's filteredArrayUsingPredicate:includePredicate)'s mutableCopy()
set excludePredicate to current application's NSPredicate's predicateWithFormat_("(self MATCHES %@)", excludePattern)
set excludeArray to theArray's filteredArrayUsingPredicate:excludePredicate
includeArray's removeObjectsInArray:excludeArray
return (includeArray's componentsJoinedByString:linefeed)
end getMatchingLines
Thank you peavine for your wonderful example of how to write Foundation framework’s AppleScript Objective C methods to find files containing specific words. I appreciate your sorting of the files, by file name, and have learned a lot from your example.
However, now that your script returns the found lines containing specific words, how might I parse through each file’s found lines, with a loop or otherwise to manipulate the found data, file by file.
I have attempted to return every paragraph of the returned result, but that attempt parsed the found items into separate paragraphs and removed the identifying file’s path from which the words were found.
Is it possible to return the found items as an array or list , in which AppleScript could then parse through each found file’s items, so that each file’s path and its found lines containing the included words could be further analyzed?
Expressed in a different manner, is it possible to return an array of arrays, rather than an array of lines separated by linefeeds. If so, is it possible to sort that array of arrays by file path name?
I am at a loss, and would appreciate more insight or direction. Thank you in advance.
akim. I’ve revised my script to return an array of arrays. The subarrays will be sorted by file name, and each subarray will contain two items–the file path and an array of matching lines.
A slight alternative to the above is to coerce the array of arrays to a list of lists, which can then be analyzed with basic AppleScript.
-- revised 2022.12.01
use framework "Foundation"
use scripting additions
set theFolder to POSIX path of (choose folder)
set theFiles to getFiles(theFolder)
set matchingData to getMatchingData(theFiles, false) -- true to number lines
on getFiles(theFolder)
set fileManager to current application's NSFileManager's defaultManager()
set theFolder to current application's |NSURL|'s fileURLWithPath:theFolder
set folderContents to fileManager's contentsOfDirectoryAtURL:(theFolder) includingPropertiesForKeys:{} options:4 |error|:(missing value)
set thePredicate to current application's NSPredicate's predicateWithFormat:"pathExtension ==[c] 'txt'"
set theFiles to (folderContents's filteredArrayUsingPredicate:thePredicate)'s valueForKey:"path"
return (theFiles's sortedArrayUsingSelector:"localizedStandardCompare:")
end getFiles
on getMatchingData(theFiles, lineNumbers)
set includeWords to "word one|word two"
set excludeWords to "word three|word four"
set includePattern to "(?im)^.*(" & includeWords & ").*$"
set excludePattern to "(?im)^.*(" & excludeWords & ").*$"
set matchingData to current application's NSMutableArray's new()
repeat with aFile in theFiles
set anArray to current application's NSMutableArray's new()
(anArray's addObject:aFile)
set matchingLines to getMatchingLines(aFile, includePattern, excludePattern, lineNumbers)
(anArray's addObject:matchingLines)
(matchingData's addObject:anArray)
end repeat
return matchingData -- coerce to list if desired
end getMatchingData
on getMatchingLines(theFile, includePattern, excludePattern, lineNumbers)
if lineNumbers is true then
set theString to do shell script "sed -E 's/'$'\\r''$//;s/'$'\\r''/\\'$'\\n''/g' " & quoted form of (theFile as text) & " | sed = | sed " & quoted form of "N;s/\\n/\\. /"
set theString to current application's NSString's stringWithString:theString
else
set theString to current application's NSString's stringWithContentsOfFile:theFile encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
set theString to theString's stringByReplacingOccurrencesOfString:(return & linefeed) withString:linefeed
end if
set theDelimiters to (current application's NSCharacterSet's newlineCharacterSet())
set theArray to (theString's componentsSeparatedByCharactersInSet:theDelimiters)
set includePredicate to current application's NSPredicate's predicateWithFormat_("(self MATCHES %@)", includePattern)
set includeArray to (theArray's filteredArrayUsingPredicate:includePredicate)'s mutableCopy()
set excludePredicate to current application's NSPredicate's predicateWithFormat_("(self MATCHES %@)", excludePattern)
set excludeArray to theArray's filteredArrayUsingPredicate:excludePredicate
(includeArray's removeObjectsInArray:excludeArray)
return includeArray
end getMatchingLines
If you coerce the array of arrays to a list of lists, you could do the analysis as follows:
repeat with aList in matchingData
set aList to contents of aList
set theFile to item 1 of aList
set matchingLines to item 2 of aList
repeat with aLine in matchingLines
set aLine to contents of aLine
-- analyze aLine
end repeat
end repeat
Peavine, I am attempting to match lines that have a certain word, such as “page” and then include the rest of that line continuing to include any subsequent lines that have no text or digits until the the command matches a line starting with digits. For example, I am attempting to capture the lines
I added
to locate patterns of zero or more spaces, returns or newlines, followed by
for any character except a new line, followed by
for any one or more digits.
set includeWords to "page"
set includePattern to "(?im)^(" & includeWords & ").*[\\s\\r\\n]*.*[\\d]*"
This however failed and yielded an empty match.
Is there a way to utilize NSPredicate’s predicateWithFormat with a regular expression to match the includeWords followed by one or more empty lines, and then ending with a line starting with digits?