Delete second instance of consecutive duplicate words from a string

peavine · August 21, 2022, 2:48pm

The script included below deletes the second instance of two consecutive duplicate words from a string. For speed and flexibility, the script utilizes ASObjC and Regular Expressions. In all of the following, the period at the end of a sentence is not part of the regular expression pattern.

When using this script, the user has to make two determinations. The first is whether case sensitivity should be considered when finding duplicate words, and the second is the exact character or characters that separate the duplicate words. The following script is case insensitive, and duplicate words are separated by one or more horizontal whitespace characters, which typically means spaces and tabs. To make the script case insensitive, simply delete (?i).

use framework "Foundation"
use scripting additions

set theString to "This is is a test 
test line. This is another Another test line."

set cleanedString to removeDuplicates(theString)

on removeDuplicates(theString)
	set thePattern to "(?i)\\b(\\w+)\\h+\\1\\b"
	set theString to current application's NSMutableString's stringWithString:theString
	(theString's replaceOccurrencesOfString:thePattern withString:"$1" options:(current application's NSRegularExpressionSearch) range:{0, theString's |length|()})
	return theString as text
end removeDuplicates

Various changes can be made to the above script, and all but one alter the characters that separate duplicate words:

Change thePattern in its entirety to “(?i)\b(\w+)(\h+\1\b)+”. This is the same as the above script but deletes more than two instance of duplicate words. Thus, to illustrate, “test test test” is changed to “test”.
Change \h+ to \h. This makes the separator one horizontal whitespace character only–there would normally be little reason to use this.
Change \h+ to \s+. The metacharacter \s is defined as a “white space character” and most commonly (but not exclusively) includes spaces, tabs, and line endings. This will match duplicate words across line endings, although the line ending is removed, which may or may not be desirable.
Change \h+ to \W+. The metacharacter \W is defined as “a non-word character” and is the most aggressive option when removing duplicate words. Thus, for example, it will remove duplicate words across line endings and will remove duplicate hyphenated words (e.g. “is-is” is changed to “is”).
Change thePattern in its entirety to “(?im)^(.*)(\n\1$)+”. This removes consecutive duplicate lines which have linefeed line endings.

A final option allows the user greater control over the characters that separate duplicate word. To test this script, copy it to a script editor; place the cursor within the brackets after “set thePattern”; and type a space, a tab, an underscore, and \-. Then run the script, which will return “This is a test line”.

use framework "Foundation"
use scripting additions

set theString to "This is-is a test_test line Line"

set cleanedString to removeDuplicateWords(theString)

on removeDuplicateWords(theString)
	set thePattern to "(?i)\\b(\\w+)[]\\1\\b"
	set theString to current application's NSMutableString's stringWithString:theString
	(theString's replaceOccurrencesOfString:thePattern withString:"$1" options:(current application's NSRegularExpressionSearch) range:{0, theString's |length|()})
	return theString as text
end removeDuplicateWords

As written, the above script returns the cleaned string only. However, changing the third line of the handler as shown below will set the replacementCount variable to the number of duplicate replacement actions taken. This number can then be returned along with the cleaned string.

set replacementCount to (theString's replaceOccurrencesOfString:thePattern withString:"$1" options:(current application's NSRegularExpressionSearch) range:{0, theString's |length|()})

An alternative to the above is to use Script Debugger. To try this, enable Script Debugger’s find option, put a check mark by RegEx, type (?i)\b(\w+)\h+\1\b in the find field, type $1 in the replace field, and click the right arrow. You can then decide whether to replace all duplicate words at once or to view and replace them one by one.