I have a list of up to 100 strings, and would like to determine whether every string in the list is unique.
I could use a brute force method of starting with the first item in the list and comparing it to every other in a loop, then move on to the second, etc.
AppleScript’s ‘contains’ or ‘is in’ commands can do the “comparing with every other” for you:
set listOfStrings to paragraphs 1 thru 100 of (read (choose file))
set end of listOfStrings to some item of listOfStrings -- Deliberate duplicate for testing.
set restOfStrings to listOfStrings
repeat with i from 1 to (count listOfStrings) - 1
set str to item i of listOfStrings
set restOfStrings to rest of restOfStrings
if (restOfStrings contains str) then display dialog "The string "" & str & "" is repeated."
end repeat
Or the other way round:
set listOfStrings to paragraphs 1 thru 100 of (read (choose file))
set end of listOfStrings to some item of listOfStrings -- Deliberate duplicate for testing.
set previousStrings to {}
set beginning of previousStrings to item 1 of listOfStrings
repeat with i from 2 to (count listOfStrings)
set str to item i of listOfStrings
if (previousStrings contains str) then display dialog "The string "" & str & "" is repeated."
set end of previousStrings to str
end repeat
For sortlist and remove duplicates you can use the shell command as well and don’t need the satimage.
set theList to {"a", "hello", "goodbye", "a", "another string", "fake text", "fake", "goodbye"}
set AppleScript's text item delimiters to character id 10
set theString to theList as string
set AppleScript's text item delimiters to ""
set sortedList to every paragraph of (do shell script "/bin/echo -n " & quoted form of theString & "| sort -u")
set listIsUnique to length of theList = length of sortedList
Hey, guys. I didn’t actually test Nigel’s first method, but I’ve previously researched the rest command; it can actually be slower than a standard repeat loop, although it might not be noticeable on such a small list.
Shane is right about that. The most advanced operating system has the worst locale of all *nix OS versions variants. Still we can use another collation to get close to our expectations. For me the following code works good enough with sort:
set theList to {"a", "hello", "goodbye", "a", "another string", "änother string", "ä", "fake text", "fake", "goodbye"}
set AppleScript's text item delimiters to character id 10
set theString to theList as string
set AppleScript's text item delimiters to ""
set sortedList to (do shell script "/bin/echo -n " & quoted form of theString & " | LC_ALL=nl_NL.ISO8859-1 sort -u")
I used dutch (nl_NL) collation here to use eventually the standard latin la_LN.ISO8859-1 to avoid the default la_LN.US-ASCII collation.
Have to say that an oldfashioned AS-Handler will do the job (without sorting, which wasn’t part of the question …) pretty fast too …
set theList to {"a", "hello", "goodbye", "a", "another string", "änother string", "ä", "fake text", "fake", "goodbye"}
set theList to my uniqueArray(theList)
on uniqueArray(ain)
set aout to {}
repeat with i from 1 to count of ain
set theItem to item i of ain
if aout does not contain theItem then set end of aout to theItem
end repeat
return aout
end uniqueArray
While the object of this thread is to “determine whether every string in the list is unique”, it’s the removal of duplicates rather than the actual sort order which is relevant to the sort-removing-duplicates methods here.
Another thing is that the Unix sort method won’t work correctly with strings containing more than one paragraph.
Indeed Nigel, we did get a little carried away and multi paragraph string is by default a problem with sort. The workaround takes so much effort that the whole process won’t be faster than a simple AppleScript loop.
Actually I merely want to alert the script user that there are duplicate strings in the list I’m checking. All the entries in the list need to remain intact.
This is part of a larger script that creates a report detailing the contents of a compact disc master prior to the master being delivered to the factory.
Every track of a CD has a unique identifier code embedded in the data stream. It’s very easy at the mastering stage to duplicate an ID code. The ID codes are a 12 character string in this format: USPR37300012.
The full script extracts track titles, composers, artists and ID codes for proof reading before the CD makes it out into the world. Also a little error checking thrown in for good measure.
Create a list with unique items and compare it’s length with the original (like in my first post). It the lengths of both lists are not equal it means that there are items double in the list.