ASObjC Regular Expression Handlers

The ASObjC implementation of regular expressions is quite good, and I’ve included some examples below. The methods and syntax used with ASObjC can be a bit arcane, but the following handlers don’t require a knowledge of that. Simply select a handler, set the regex pattern, and run the handler with a string as a parameter. A few preliminary notes:

  • ASObjC follows the ICU specification for regular expressions (here).

  • Two backslashes need to be used just about anywhere one backslash would normally be used in a regex pattern. For example, use \\d for a digit and \\. for a literal dot.

  • Option 1024 is the enumeration for NSRegularExpressionSearch.

  • These handlers are case sensitive, but that can be changed by inserting (?i) at the beginning of the regex pattern.

  • Many of the handlers in this thread are based on scripts in Shane’s ASObjC book.

All of the handlers require that the following header be placed at the beginning the script.

use framework "Foundation"
use scripting additions

Handler 1 does a simple search and replace.

--Search and Replace
--Replace all instances of three consecutive digits with "xxx"

set theString to "aaa 111 bbb 22 ccc 333 ddd"

set newString to getNewString(theString) -->"aaa xxx bbb 22 ccc xxx ddd"

on getNewString(theString)
	set theString to current application's NSString's stringWithString:theString
	set thePattern to "\\d{3}"
	return (theString's stringByReplacingOccurrencesOfString:thePattern withString:"xxx" options:1024 range:{0, theString's |length|()}) as text
end getNewString

Handler 2 does a search and replace with a capture group, which is the portion of the pattern within parentheses. The handler returns the substring matched by the pattern within the capture group and does not return the substring matched by the pattern outside the capture group. There can be multiple capture groups, and they are identified as $1, $2, and so on.

--Search and Replace with Capture Group
--Return first instance of characters preceded and followed by \"

set theString to "This is \"quoted text\" in a string"

set theSubstring to getSubstring(theString) -->"quoted text"

on getSubstring(theString)
	set theString to current application's NSString's stringWithString:theString
	set thePattern to ".*?\\\"(.+?)\\\".*"
	return (theString's stringByReplacingOccurrencesOfString:thePattern withString:"$1" options:1024 range:{0, theString's |length|()}) as text
end getSubstring

Handler 3 returns every substring that matches the regex pattern.

--Return all matches
--Return all instances of 3 consecutive digits

set theString to "aaa 111 bbb 22 ccc 333 ddd"

set matchingSubstrings to getMatchingSubstrings(theString) --> {"111", "333"}

on getMatchingSubstrings(theString)
	set theString to current application's NSString's stringWithString:theString
	set thePattern to "\\d{3}"
	set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
	set regexResults to theRegex's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
	set theRanges to (regexResults's valueForKey:"range") --an optimization
	set theMatches to current application's NSMutableArray's new()
	repeat with aRange in theRanges
		(theMatches's addObject:(theString's substringWithRange:aRange))
	end repeat
	return theMatches as list
end getMatchingSubstrings

Handler 4 is the same as handler 3 but only returns the first matching substring.

--Return first match
--Return first instance of 3 consecutive digits

set theString to "aaa 111 bbb 22 ccc 333 ddd"

set matchingSubstring to getMatchingSubstring(theString) -->"111"

on getMatchingSubstring(theString)
	set theString to current application's NSString's stringWithString:theString
	set thePattern to "\\d{3}"
	set theRange to theString's rangeOfString:thePattern options:1024
	return (theString's substringWithRange:theRange) as text
end getMatchingSubstring

Handler 5 is the same as handler 3, but it uses a capture group.

--Return all matches with capture group
--Return all matches in parentheses

set theString to "aaa (111) bbb 22 ccc (333) ddd"

set theSubstrings to getSubstrings(theString) -->{"111", "333"}

on getSubstrings(theString)
	set theString to current application's NSString's stringWithString:theString
	set thePattern to "\\((.+?)\\)"
	set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
	set regexResults to theRegex's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
	set theMatches to current application's NSMutableArray's new()
	repeat with aMatch in regexResults
		set theRange to (aMatch's rangeAtIndex:1) --capture group 1
		(theMatches's addObject:(theString's substringWithRange:theRange))
	end repeat
	return theMatches as list
end getSubstrings

Handler 6 returns a count of the matches. This handler is extremely fast, taking only a few milliseconds to count over 1,000 matches in the text of a 159-page PDF book.

--Return the number of matches found
--Return the number of 3 consecutive digits

set theString to "aaa 111 bbb 22 ccc 333 ddd"

set matchCount to getMatchCount(theString) -->2

on getMatchCount(theString)
	set theString to current application's NSString's stringWithString:theString
	set thePattern to "\\d{3}"
	set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
	return theRegex's numberOfMatchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
end getMatchCount
1 Like

I’ve included below a few regex patterns that might be of use.

Trim horizontal whitespace from beginning and end of every line
Handler - handler 1
Pattern - "(?m)^\\h+|\\h+$"
Replacement - ""

Remove every line that is empty or contains whitespace only
Handler - handler 2
Pattern (from Nigel) - "(\\R)(?:\\h*+\\R)++"
Replacement - "$1"

Count words in a string
Handler - handler 6
Pattern - "\\w+"
Comment - edit pattern based on what constitutes a word

Return the nth line of a string
Handler - handler 2
Pattern - "(?s)(?:.*?\\n){0}(.*?)(?:\\n|$).*"
Replacement - "$1"
Comment - the line number is zero-based and is 0 in the above pattern

Return multiple lines of a string
Handler - handler 2
Pattern - "(?s)(?:.*?\\n){2}((?:.*?\\n){1}.*?)(?:\\n|$).*"
Replacement - "$1"
Comment - the 2 in the pattern is the zero-based start line and the 1 is the zero-based number of lines to match (i.e. the above pattern matches lines 3 and 4)

Return lines that contain any of the specified words
Handler - handler 3
Pattern - "(?m)^.*(?:word one|word two).*$"
Comment - the vertical bar is alternation and can be repeated multiple times

Peavine, I appreciate your instructional approach, breaking down Regex patterns into more understandable components.
In a similar vein, I have enjoyed your similar approaches to other Applescript topics in other postings at MacScripter.

1 Like