Regex Pattern Refinement

I wrote a regex pattern that removes from a string every line that has a duplicate checksum in another line. This generally works well, but I wondered if there might be a refinement. One negative aspect of my current solution is that it leaves blank lines in the output string. Thanks!

use framework "Foundation"
use scripting additions

set theString to "6dc6cf42d71af96b3a3175e9af264b5a /Volumes/Store/Save/2024 08/Records/aa aa.jpg
6dc6cf42d71af96b3a3175e9af264b5a /Volumes/Store/Save/2024 09/Records/bb bb.jpg
6dc6cf42d71af96b3a3175e9af264b5a /Volumes/Store/Save/2024 10/Records/cc cc.jpg
6dc6cf42d71af96b3a3175e9af264b5a /Volumes/Store/Save/1Current/Records/dd dd.jpg
5dc6cf33d71af96b3a3175e9af264b5a /Volumes/Store/Save/1Current/Records/dd dd.jpg
d1909c2aca9d213be0ac96f521afaf3b /Volumes/Store/Save/2024 08/Records/ee ee.jpg
d1909c2aca9d213be0ac96f521afaf3b /Volumes/Store/Save/2024 09/Records/ff ff.jpg
d1909c2aca9d213be0ac96f521afaf3b /Volumes/Store/Save/2024 10/Records/gg gg.jpg
d1909c2aca9d213be0ac96f521afaf3b /Volumes/Store/Save/1Current/Records/hh hh.jpg
8xc6cf33d71af96b3a3175e9af264b8d /Volumes/Store/Save/1Current/Records/mm mm.jpg"
set theString to current application's NSString's stringWithString:theString

set noDuplicates to (theString's stringByReplacingOccurrencesOfString:"(?m)^(.+?) .+?(\\n\\1 .+$)+" withString:"" options:1024 range:{0, theString's |length|()}) as list

Hi @peavine.

set noDuplicates to (theString's stringByReplacingOccurrencesOfString:"([^ ]++ ).++\\n(?:\\1.++\\n)++" withString:"" options:1024 range:{0, theString's |length|()}) as list

Or possibly:

set noDuplicates to (theString's stringByReplacingOccurrencesOfString:"([^ ]{32} ).++\\n(?:\\1.++\\n)++" withString:"" options:1024 range:{0, theString's |length|()}) as list
1 Like

Nigel. Thanks for the suggestions.

Both patterns work well except in the situation shown below. This can be fixed by adding a linefeed to the end of theString. I’ll work through the pattern to learn its operation.

use framework "Foundation"
use scripting additions

set theString to "6dc6cf42d71af96b3a3175e9af264b5a /Volumes/Store/Save/2024 08/Records/aa aa.jpg
6dc6cf42d71af96b3a3175e9af264b5a /Volumes/Store/Save/2024 09/Records/bb bb.jpg
6dc6cf42d71af96b3a3175e9af264b5a /Volumes/Store/Save/2024 10/Records/cc cc.jpg
6dc6cf42d71af96b3a3175e9af264b5a /Volumes/Store/Save/1Current/Records/dd dd.jpg
5dc6cf33d71af96b3a3175e9af264b5a /Volumes/Store/Save/1Current/Records/dd dd.jpg
d1909c2aca9d213be0ac96f521afaf3b /Volumes/Store/Save/2024 08/Records/ee ee.jpg
d1909c2aca9d213be0ac96f521afaf3b /Volumes/Store/Save/2024 09/Records/ff ff.jpg
d1909c2aca9d213be0ac96f521afaf3b /Volumes/Store/Save/2024 10/Records/gg gg.jpg
d1909c2aca9d213be0ac96f521afaf3b /Volumes/Store/Save/1Current/Records/hh hh.jpg"
set theString to current application's NSString's stringWithString:theString

set noDuplicates to (theString's stringByReplacingOccurrencesOfString:"([^ ]++ ).++\\n(?:\\1.++\\n)++" withString:"" options:1024 range:{0, theString's |length|()}) as list

Hi peavine.

What seems to work is to make both linefeeds in the pattern optional:

set noDuplicates to (theString's stringByReplacingOccurrencesOfString:"([^ ]++ ).++\\n?(?:\\1.++\\n?)++" withString:"" options:1024 range:{0, theString's |length|()}) as list
1 Like

Thanks Nigel. That seems to do the job. I’ll study the pattern tomorrow to understand its operation.

Hi @peavine.

Sorry. I was deceived by the line wrap in my results window. My last suggestion doesn’t work properly with your first test string after all. My latest thinking is to make the linefeeds non-optional after all but to allow the end of the text as an alternative to the second one:

set noDuplicates to (theString's stringByReplacingOccurrencesOfString:"([^ ]++ ).++\\n(?:\\1.++(?:\\n|$))++" withString:"" options:1024 range:{0, theString's |length|()}) as list

This leaves an empty line at the end of the returned text if the last line’s one of those deleted and at least one of the other lines survives. This is because of the linefeed at the end of the last of the surviving lines. I can’t think of a foolproof way around this apart from deleting the empty line afterwards if it’s there.

An alternative to “\\n” for linefeed would be “\\R” for any kind of line ending.

I notice your result is returned as a list. Should that be as text?

1 Like

Nigel. Thanks for the revised suggestion.

The original reason for my request was a find-duplicates script I posted in the Late Night Software forum. I substituted your pattern in one of these scripts and tested it with a large folder. I then tested that result against a script that used a repeat loop instead of a regex. The returned results were identical–so your suggestion works great.

These scripts might be of use when finding duplicates when the number of files being processed is not too great. I’ll post them later in the Code Exchange forum.

As you surmised, I meant the coercion to be text not a list.

Thanks again for your help.