Removing duplicate characters from a string

First Code Exchange post in a while. Just to share an efficient method of removing duplicate characters from a string, which is a task that arises on occasion. It leverages AppleScript’s text item delimiters property to split the string into individual characters, and ASObjC’s NSOrderedSet class to remove duplicate characters while maintaining character order. The example demonstrates that it works for UTF-8 characters outside the ASCII character set and in a case-sensitive manner:

use framework "Cocoa"
use scripting additions

set sampleString to "3Xwww3✔✓°¦¦✓wWWΦ3X"
---------------------------------------------------------------
set {tid, AppleScript's text item delimiters} to {AppleScript's text item delimiters, ""}
set {sampleStringWithDuplicatesRemoved, AppleScript's text item delimiters} to {((current application's NSOrderedSet's orderedSetWithArray:(sampleString's text items))'s array()) as list as text, tid}
---------------------------------------------------------------
sampleStringWithDuplicatesRemoved --> "3Xw✔✓°¦WΦ"

Hi bmose.

It can also be done without AppleScript’s text item delimiters:

use framework "Foundation"
use scripting additions

set sampleString to "3Xwww3✔✓°¦¦✓wWWΦ3X"
---------------------------------------------------------------
set sampleStringWithDuplicatesRemoved to ((current application's NSOrderedSet's orderedSetWithArray:(sampleString's characters))'s array()'s componentsJoinedByString:("")) as text
---------------------------------------------------------------
sampleStringWithDuplicatesRemoved --> "3Xw✔✓°¦WΦ"

Hi, Nigel. That’s even more terse. Thanks!

Just curious about a fundamental AppleScript matter. Is there any scenario where a string split into text items with an empty string delimiter will differ from the characters returned by the string’s characters property? Or are they processed under the hood in an identical or functionally identical manner?

Here is a native AppleScript version…

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

set sampleString to "3Xwww3✔✓°¦¦✓wWWΦ3X"
---------------------------------------------------------------
set tid to text item delimiters
set text item delimiters to ""
set i to 1
considering case
	repeat while i < length of sampleString
		set c to text i of sampleString
		set text item delimiters to c
		set sampleString to (text items of sampleString)
		tell sampleString to set item 1 to item 1 & c
		set text item delimiters to ""
		set sampleString to sampleString as text
		set i to i + 1
	end repeat
end considering
set text item delimiters to tid
---------------------------------------------------------------
sampleString --> "3Xw✔✓°¦WΦ"

or a version without text item delimiters…

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

set sampleString to "3Xwww3✔✓°¦¦✓wWWΦ3X"
---------------------------------------------------------------
set sampleStringWithDuplicatesRemoved to ""
considering case
	repeat with i from 1 to length of sampleString
		set c to text i of sampleString
		if c is not in sampleStringWithDuplicatesRemoved then set sampleStringWithDuplicatesRemoved to sampleStringWithDuplicatesRemoved & c
	end repeat
end considering
---------------------------------------------------------------
sampleStringWithDuplicatesRemoved --> "3Xw✔✓°¦WΦ"

Hi bmose. I can’t think of any offhand. :thinking:

I’ve never seen a difference in the characters returned either.

This is based on Robert’s delimiter script, but keeps the unique characters aside while winding down the sample string, which may be a tad more efficient. Not that the difference will be noticeable in practice! :wink:

set sampleString to "3Xwww3✔✓°¦¦✓wWWΦ3X"
---------------------------------------------------------------
set uniques to {}
set tid to text item delimiters
considering case
	repeat while ((count sampleString) > 1)
		set text item delimiters to character 1 of sampleString
		set end of uniques to text item delimiters
		set sampleString to (text items of sampleString)
		set text item delimiters to ""
		set sampleString to sampleString as text
	end repeat
end considering
set end of uniques to sampleString
set sampleStringWithDuplicatesRemoved to uniques as text
set text item delimiters to tid
---------------------------------------------------------------
sampleStringWithDuplicatesRemoved --> "3Xw✔✓°¦WΦ"

Does anyone know of a regex pattern that will work in the following script to remove both consecutive and non-consecutive duplicate characters. As written, only consecutive duplicate characters are removed. Thanks!

use framework "Foundation"
use scripting additions

set theString to "3Xwww3✔✓°¦¦✓wWWΦ3X"
set theString to current application's NSString's stringWithString:theString
set regexPattern to "(.)\\1+"
set cleanedString to (theString's stringByReplacingOccurrencesOfString:regexPattern withString:"$1" options:1024 range:{0, theString's |length|()}) as text -->"3Xw3✔✓°¦✓wWΦ3X"

Here is a regular expression solution, condensed into a single line:

tell (current application's NSString's stringWithString:(sampleString's characters's reverse as text)) to set sampleStringWithDuplicatesRemoved to ((its stringByReplacingOccurrencesOfString:"(.)(?=.*?\\1)" withString:"" options:(current application's NSRegularExpressionSearch) range:{0, its |length|()}) as text)'s characters's reverse as text

It works by first reversing the characters in the string:

sampleString’s characters’s reverse as text

so that a look-ahead assertion, which allows unbounded string matches, rather than a look-behind assertion, which does not, can be used. The regex pattern looks through each character in the reversed string, marking it as capture group #1:

(.)

It deletes that character via:

withString:“”

if the look-ahead pattern:

(?=.*?\1)

asserts that the character is followed by zero or more characters other than itself (via lazy matching):

.*?

followed by the character itself:

\1

Finally, it reverses the characters in the resulting string:

…as text)'s characters’s reverse as text

The result is the original string with duplicate characters removed.

I don’t know if the same result can be achieved using a look-behind assertion.

2 Likes

Incidentally, I tested all five solutions – my original ASObjC solution, Nigel’s more condensed version of the ASObjC solution, robertfern’s two AppleScript solutions, and the regex solution. For the simple input string “3Xwww3✔✓°¦¦✓wWWΦ3X”, they all execute very rapidly – on the order of 0.0001 seconds on my machine – and I could not detect a difference among them. I suspect that differences might emerge with very long input strings.

1 Like

Thanks bmose. That works great.

Just to insure I understood the operation of your script, I rewrote it in a slightly different format. All seems to work correctly.

use framework "Foundation"
use scripting additions

set theString to "3Xwww3✔✓°¦¦✓wWWΦ3X"
set reversedString to (reverse of (characters of theString)) as text
set reversedString to current application's NSString's stringWithString:reversedString
set cleanedString to (reversedString's stringByReplacingOccurrencesOfString:"(.)(?=.*?\\1)" withString:"" options:1024 range:{0, reversedString's |length|()}) as text --option 1024 is regex
set cleanedString to (reverse of (characters of cleanedString)) as text -->"3Xw✔✓°¦WΦ"

:sunglasses: !!!

But don’t forget to set the TIDs for the list-to-text coercions. :wink:

As Nigel points out, the regex solution, specifically the two as text coercions which transform lists of characters into strings, will work properly only if text item delimiters are set to the empty string. This is usually the case, but better not to assume that it is so:

set {tid, AppleScript's text item delimiters} to {AppleScript's text item delimiters, ""}
tell (current application's NSString's stringWithString:(sampleString's characters's reverse as text)) to set sampleStringWithDuplicatesRemoved to ((its stringByReplacingOccurrencesOfString:"(.)(?=.*?\\1)" withString:"" options:(current application's NSRegularExpressionSearch) range:{0, its |length|()}) as text)'s characters's reverse as text
set AppleScript's text item delimiters to tid