Well I can explain how regular expression works, which makes the code above better understanable. An regular expression is nothing more than an character sequence/string prediction. With repetitions and boundaries you make an string prediction of how an string would look like. It’s an advanced search utility.
When you search for an exact match, regular expressions are simple as well. If you search for ‘Hello world’, the regular expression will only do an exact match like any other basic search. Because the match is done on byte level and not character level, the matches are case sensitive. That would mean that if the given string contains ‘hello world’ that there is no match. some tools, like grep, have the option to do an case insensitive expression.
When an string starts at the beginning of an sentence, the first character is capped. To get these strings as well we can define character ranges. In our previous example we’re talked about an exact match, but basically every single character is an character range with only 1 character in it. Character ranged are defined within brackets ([ and ]). So ‘Hello world!’ is exactly the same as ‘[H][e][l][l][o][ ][W][o][r][l][d]’ but it almost unreadable, single characters shouldn’t be surrounded by brackets. To make an regular expression respond to multiple first characters we could make ‘[Hh]ello world’. Now when this is string is at the beginning of an sentence or not, it will match because we said that the first character of the match could be H and h.
[0-9] “ which is the same as [0123456789] “ are character ranges where the byte values (according to the ascii table) followup. The hyphen means that it will look for it’s byte value left of the hypen and right of the hypen and all byte values in between. Also [a-z] is the same as [abcdefghijklmnopqrstuvwxyz], no comma’s, I assume McUsr used pseudo code.
Then there are some macro’s for you that you can use. For instance [:digit:] is the same as [0-9] which is the same as [0123456789]. Then there is also [:alnum:] which is the same as [0-9a-zA-Z]. There are also shorter macro’s which is \b for example. This short macro means word boundary, only non word characters allowed. It is the same as [^0-9a-zA-Z] were the caret means an not comparison. Another most used macro is the period, it means any character is allowed. When using the macro \b with our example we’re saying that we only want it is an complete word. For instance now wit ‘[Hh]ello world’ we also match with an string like ‘hello worlds’, when we wrap our expression in word boundaries we avoid this. So our expression should look something like ‘\b[Hh]ello world\b’ (NOTE: in AppleScript we need \b because \ has an special meaning, not because of the regular expression)
At last we have to define how many time our ranges match. By default, of course, the match is 1 time at least. Therefore we can do excact matches. But we can also define optional matches, or multiple with static or variable lengths (repetitions). There is *, ?, +, {n}, {n, } and {n, m} which means (copied from grep man page)
? The preceding item is optional and matched at most once.
{n} The preceding item is matched exactly n times.
{n,} The preceding item is matched n or more times.
{n,m} The preceding item is matched at least n times, but not more than m times.
It’s an bit of overkill but our hello world expression has an repetition in it, the double L. So to make sure there are two L in hello we could write ‘\b[Hh]el{2}o world\b’. Because l is in single character option we don’t need the brackets. Defining repetitions makes sometimes your expression better readable. When I want numbers from an text who are 8 in length I could make an expression like ‘[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]’ or I could write down ‘[0-9]{8}’ which is obvious better to read.
Basically there are two things to remember. 1) Every character, if not surrounded by an bracket, should be considered as an character range of only 1 character; everything is an character range. 2)Define the amount of repetitions if it’s not 1.
An expression like ‘rapdigital’ is just the same as ‘[r]{1}[a]{1}[p]{1}[d]{1}[i]{1}[g]{1}[i]{1}[t]{1}[a]{1}[l]’.
Now back to your question now we understand the basics of regular expressions:
'\\b[0-9]{4}_[0-9]{2}_[0-9]{3}_[0-9]{3}\\b'
Pseudo code:
\\b (\b) : word boundary
[0-9] : only characters 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 are allowed
{4} : Repeat the preceding match 4 times
_ : equal to [_] only underscore allowed, no repetition; 1 match
[0-9] : only characters 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 are allowed
{2} : Repeat the preceding match 2 times
_ : equal to [_] only underscore allowed, no repetition; 1 match
[0-9] : only characters 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 are allowed
{3} : Repeat the preceding match 3 times
_ : equal to [_] only underscore allowed, no repetition; 1 match
[0-9] : only characters 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 are allowed
{3} : Repeat the preceding match 3 times
\\b (\b) : word boundary