Regular Expression Lookbehind Pattern Question

The following regex pattern does what I want, but I don’t completely understand its operation. Why is the positive lookbehind required? Is there a better way to do this? Thanks!

use framework "Foundation"
use scripting additions

set theString to "A Number 123"
set theString to "A Number .123"
set theString to "A Number 0.123"
set theString to "A Number 1.123"
set theString to current application's NSString's stringWithString:theString

set thePattern to "\\d*(\\.\\d+)?(?<=\\d)"

set theRange to theString's rangeOfString:thePattern options:1024
set patternMatch to (theString's substringWithRange:theRange) as text

Hi peavine.

I’m still puzzling over this, but I think it’s something to do with everything before the lookbehind being optional. The lookbehind forces it not to be. A simpler pattern would be “\\d*\\.?\\d+”.

Thanks Nigel. Your suggestion works great. I’ve gotten reasonably good at regex, but for some reason your simple solution escaped me.

Hi @peavine.

Applying my ancient brain to this some more this morning, the givens are:

  1. The regex engine tries to match the pattern within a given range of theString.
  2. It tries to match the entire pattern.
  3. If an entire match is found, it’s returned immediately. Otherwise the engine moves on to the character after the current start character and tries again.
  4. With rangeOfString:options:, the search range is {location:0 |length|:(theString’s |length|())}. There’s also rangeOfString:options:range:, which allows a more precise search range to be specified.

My theory is that, with the pattern you tried, the regex engine checks the first character of theString (“A”) to see if it’s a digit. It’s not, but that’s OK because digits at the beginning of the match are optional. So the engine keeps that insertion point and goes on to the period-followed-by-digits group. This also doesn’t match the “A”, but again it’s OK because the group’s optional too.

So far, the match is successful. So if the lookbehind’s omitted, rangeOfString:options: immediately returns {location:0, |length|:0} (representing the insertion point at the beginning of theString), which substringWithRange: renders as “”.

But the lookbehind requires the insertion point at the end of the match to be preceded by a digit (ie. the match has to end with a digit), so the engine starts again at the space after the “A” and tries to match the pattern from there. It repeats the entire process from all the characters up to the first digit or period, at which point it begins to match what’s actually wanted. If theString were to start with a digit or period, the lookbehind wouldn’t be needed.

Now for another coffee….

Nigel. Thanks for the additional information.

I tested the patterns on the regular expressions 101 site and got the results shown below. My pattern is a bit of a disaster, causing the regex to do a great deal more work than is necessary.

What surprises me is that I’m fairly certain I tried your suggestion earlier on, but it didn’t work. I must have transposed or inserted something that resulted in this failure. I’m actually using the pattern in a shortcut and that may have been a contributing factor.

Nigel’s pattern:

Peavine’s pattern:

Or, this regex:

1 Like

Hi @VikingOSX.

Yes. That’s simpler still, but not as foolproof. For instance, it could return the full stop at the end of a sentence. Or a string of them: “…”.

Or this, non-exhaustive possibility… :wink:

Screenshot 2024-10-05 at 4.33.56 PM

.:wink:

The plus sign in [\d+.] isn’t a regex operator but the literal character “+”, which, like the “.”, will be matched if it occurs anywhere within a sequence of characters matching your full pattern.

It’s not a bad idea to cater for the possibility of a sign though. This includes the sign if one occurs in the right place:

set thePattern to "[-+±]?\\d*\\.?\\d++"

This is similar, but omits “+” if it occurs:

set thePattern to "[-±]?\\d*\\.?\\d++"