Remove Duplicates from a List?

You can use this:

set x to {"a@a.e", "b@b.c", "d@a.h", "a@a.e"}

addr(x) --> {"a@a.e", "b@b.c", "d@a.h"}

to addr(l)
	script foo
		property foo2 : l
		property okAddresses : {}
	end script
	
	considering case
		repeat with i from 1 to count (foo's foo2)
			set x to ((foo's foo2)'s item i)
			if x is in foo's okAddresses then
			else
				set end of foo's okAddresses to x
			end if
		end repeat
	end considering
	foo's okAddresses
end addr
  • Acceleration techniques learnt from Nigel Garvey, drops here 1second for a 1000 items list

WHOA. that’s pretty insane! works perfect too.
much thanks to you and Nigel. i learned a lot just from reading this script.

Very cool !

Now why do you use a script object:


script foo 
      property foo2 : l 
      property okAddresses : {} 
end script 

in stead of ordinary lists ?

how does this speed things up ?

I just believe :lol:
If you are very interested, you can enable/disable the following speed-techniques:
-Accessing list items in script objects.
-Using a “considering case” statement.
-Use “if x then[nothing]else[something]” instead of “if not x then[something]” (not very relevant here, I think, but just in case.
The “regular” version is 4 times slower in my tests:

to addr2(l)
	set okAddresses to {}
	repeat with i from 1 to count l
		set x to (l's item i)
		if x is not in okAddresses then set end of okAddresses to x
	end repeat
	okAddresses
end addr2

For some reason known only to the people at Apple, access to the items in an AppleScript list is dramatically faster if the list variable is “referenced” rather than simply being — er — used. A “reference” is a descriptive phrase like ‘my longList’ or ‘foo2 of foo’.

The only kinds of variable that can be referenced are globals and properties. Local variables can’t be referenced, so if you want the speed advantage inside a handler, you either have to use a global variable (generally considered to be bad practice) or set up a script object with its own list properties. It’s not the script object itself that speeds things up. It’s the fact that it enables the use of references in the handler.

By default, AppleScript ignores case when comparing strings, but considers everything else. ‘Ignoring’ involves more work than ‘considering’, because AppleScript has to recognise some characters as being the same when in fact they’re not. If you know beforehand that the strings will have the same case — or that one or both of them will have no case at all, as with numerals and punctuation — it makes sense to use ‘considering case’ to reduce the amount of background work. I personally wouldn’t use it when comparing e-mail addresses, but I don’t know the circumstances of the dupes in Muad’Dib’s list.

This is one of my sillier ideas. I don’t know if I like it or not. :wink: The time it saves is only significant in intensely repetitive situations.

Very helpful.

In case anyone comes across the same problem I was having…

I have lists, where each entry of the list is a record:

set tabGoods to {{|style|:"g185", colorCode:"51", colorName:"black"}, {|style|:"g185", colorCode:"51", colorName:"black"}}

And these functions to remove dulicates did NOT work on it. I don’t know why, but they spit the list back out again with the duplicates still in place.

I’m not sure why these don’t work and what I ended up writing does work, but just in case this is useful to anyone, this removes duplicates even when the list items are records:

on removeDuplicateRecords(inputList)
	set itemCount to count of items in inputList
	set outputList to {}
	repeat with anItem from 1 to itemCount
		set firstListItem to item anItem of inputList
		set occurrenceCount to 0
		repeat with anotherItem from 1 to count of items in outputList
			set secondListItem to item anotherItem of outputList
			if firstListItem is secondListItem then set occurrenceCount to occurrenceCount + 1
		end repeat
		if occurrenceCount = 0 then copy firstListItem to end of outputList
	end repeat
	
	return outputList
end removeDuplicateRecords

This could be painfully slow for large sets of records, I really don’t know. My lists have at most maybe 10-20 records, so it’s not significant. The longest lists I ran it on, it took 127 milliseconds, so it’s not stressing me out, :slight_smile: but I’m guessing from that time that it would not scale well to thousands… but at least it works for records.

  • t.spoon

Hi t.spoon.

The problem with the original script (apart from the fact that it no longer opens correctly in Script Editor!) is this line:

if x is in foo's okAddresses then

It tends to get written this way because it works with simple objects like strings and numbers. But the correct formulation when using ‘is in’ or ‘contains’ with a list of items is:

if {x} is in foo's okAddresses then

Notice the braces round ‘x’. The reason for them is that the code’s notionally looking for a section of the list, not, as we think of it, for an item in the list.

set tabGoods to {{|style|:"g185", colorCode:"51", colorName:"black"}, {|style|:"g185", colorCode:"51", colorName:"black"}}

tabGoods contains {|style|:"g185", colorCode:"51", colorName:"black"}
--> false

tabGoods contains {{|style|:"g185", colorCode:"51", colorName:"black"}}
--> true

tabGoods contains {{|style|:"g185", colorCode:"51", colorName:"black"}, {|style|:"g185", colorCode:"51", colorName:"black"}}
--> true

The same’s notionally true with text:

"Hell" is in "Hello"
--> true ” not because "Hello"'s a container containing "Hell", but because "Hell"'s a subsection of it.

I think the reason you can get away without the braces when checking for a text or number in the list is that in AppleScript, a single item is automatically coercible to a list containing that item, so we don’t have to think about it. But when the item’s already a list or a record, the coercion to list takes on a different meaning. In these cases, we have to be explicit with the braces. But using braces is actually correct in any case.

Hope this makes sense. :slight_smile:

Edit: Yes. I was right about items being coerced to lists. Here’s a short demo:

set tabGoods to {"g185", "51", "black", "g185", "51", "black"} -- Now a list of texts.

"g185" is in tabGoods
--> true, because "g185" is automatically coerced to {"g185"} (a text to list coercion) for the check.

{|style|:"g185", colorCode:"51", colorName:"black"} is in tabGoods
--> true, because the record is coerced to {"g185", "51", "black"} (a record to list coercion) for the check.
--> This is just to demonstrate that the coercion takes place. A record to list coercion should never be relied upon to produce a list with items in a particular order.

Hello Nigel.

That was a brilliant explanation.

It made a lot of sense to me.

Thanks

Some extra to Nigel’s explanation:


{2, 3} is in {1, 2, 3, 4, 5} --> true
{2, 4} is in {1, 2, 3, 4, 5} --> false

The reason why the first line will return true and the second false is that the first line is a subset of the list and the second line not, even when all values matches.

I didn’t know if records will be coerced into lists before comparing but I did know that records only compare values. A presumable reason behind this is that a scripting addition for instance can mess up the comparison (read: user defined key turn into a enumerated key). Technically there is a difference between a record containing user defined keys and enumerated keys. A record with user defined keys is actually a record containing one key (usrf) and a list as it’s value containing all the keys and values. The odd indexes are key values as normal AppleScript strings followed by their values. A record with enumerated keys are not. So when compared it’s better to only compare their values with their associated indexes. Which results in the same behavior as coercing into list first before comparing.

To make it it better understandable. A list as in the example of Nigel is actually stored as:

Then when a scripting addition is installed or other script library loaded into global scope and have colorCode and colorName enumerated respectively into ccod and cnam code, the list would look like:

Both lists would be presented the same way, except for some syntax highlighting, in script editor. If the records would be compared including their keys they would not match. But when only values are compared they will.

[offtopic]This is also why it’s important to use pipes around keys in records when using AppleScriptObjC, so you don’t send an enumerated key by accident[/offtopic]

I didn’t know that either, I did know hower that I could coerce a record to a list, on a one by one basis, what I didn’t know, or didn’t think of, was that I could coerce it with {} so I could use an “is in” expresson. :wink:

Great indepth on lists! :slight_smile:

The same principle applies if you’re obliged to concatenate something to a list:

set aList to {"a", "b", "c"}

aList & {|style|:"g185", colorCode:"51", colorName:"black"}
--> {"a", "b", "c", "g185", "51", "black"}

aList & {{|style|:"g185", colorCode:"51", colorName:"black"}}
--> {"a", "b", "c", {|style|:"g185", colorCode:"51", colorName:"black"}}

aList & {1, 2, 3}
--> {"a", "b", "c", 1, 2, 3}

aList & {{1, 2, 3}}
--> {"a", "b", "c", {1, 2, 3}}

Hello.

The concatenation examples were interesting, there we go again, with the record. The list example, is somewhere I have been. :slight_smile:

It is the "list compatible thing in order to search for elements, and records, (especially records), that has been an “aha” experience for me, but then again, looking at the difference, between a list of characters and strings, it is quite natural, that one object must be of the same form, as the object you want to check for containement of it.

set m to {{1, 2, 3, 4}, {5, 6, 7, 8}}

log ({1, 2, 3, 4} is in m) as text
-- false
log ({{1, 2, 3, 4}} is in m) as text
-- true 
-- and this one, so this is a little bit smarter than text item delimiters after all, you can't overstep "item boundaries"
log ({{3, 4, 5, 6}} is in m) as text
-- false

I see a lot of uses for this. Thanks a lot. :slight_smile:

I

Indeed. So consider this script:

set x to display dialog "display" default answer "answer"
set x to {zz:"zz", a:"a"} & x & {z:"z", aa:"aa"}
x as list
--> {"zz", "a", "OK", "answer", "z", "aa"}

It seems that, somehow, the order that the items have been added is preserved in the order of the resulting list. Do you have any idea how? If I add:

set the clipboard to x

I see:

‘Jons’'pClp’{ ‘----’:{ ‘bhit’:‘utxt’(“OK”), ‘ttxt’:‘utxt’(“answer”), ‘usrf’:[ ‘utxt’(“zz”), ‘utxt’(“zz”), ‘utxt’(“a”), ‘utxt’(“a”), ‘utxt’(“z”), ‘utxt’(“z”), ‘utxt’(“aa”), ‘utxt’(“aa”) ] }, &‘subj’:null(), &‘csig’:65536 }

which is what I’d expect, but doesn’t explain the placement of the non-user items in the final list.

Although more relevant to understanding of the use of ‘is in’ and ‘contains’ with records and lists:

set x to display dialog "display" default answer "answer"
set x to {zz:"zz", a:"a"} & x & {z:"z", aa:"aa"}
x contains {button returned:"OK", aa:"aa", z:"z", zz:"zz"}
--> true

Yes, and that makes sense if you assume there’s no order to record items. But the previous scripts suggest there is, at some level.

AppleScript is nothing if not entertaining…

There is some magic going on there. But the reason behind that is that a record can only contain enumerated keys and not user defined keys. Also another aspect is that those keys are only allowed once in the list, you can’t have the same enumerated key twice in a record. The user defined keys are therefore collected into one list and filled under a single keyword named ‘usrf’.

That makes that a record is different in presentation (AppleScript) and actual data (AppleEvent aka AERecord). It must be the AppleScript layer of the record keeping track of the order of the items while there is no such thing in the AppleEvent tier. I can confirm while I wrote scripting additions, as you have experienced yourself probably, the order of the AppleScript record isn’t always the same as the order of an AppleEvent record that comes in. That there is a difference was the only logical explanation I had back then, and still applies to this weird behavior.

To confirm my point I thought of running a mixed record through the AppleEvent manager and return the data back and see what happens. I have found that the items were rearranged inside the record, so once a mixed record leaves the AppleScript world it’s order is lost. What if we do the same with your script:

set x to display dialog "display" default answer "answer"
set x to {zz:"zz", a:"a"} & x & {z:"z", aa:"aa"}

script scriptX
	on run argv
		return item 1 of argv
	end run
end script

run script scriptX with parameters {x} --> an re-arranged record

As you see the order of which you have set the items are lost and rearranged.

This is the closest answer I can get to to your question “Do you have any idea?”. I have no idea what really happens in code, but based on AppleEvent’s transparency and testing AppleScript code it is clear than an AppleScript record is not (entirely) the same as an AppleEvent record. An extra irreversible coercion/transition is made when an AppleScript records will enter the world of AppleEvents.

Is it a bug or poor implementation? No, AppleScript and AppleEvents are both clear that order of values in a record is not a guarantee. As with normal hash tables and associative arrays in other programming languages, the index is not important including the order in which the items are stored.

edit— shane showed i’m wrong about a conclusion… maybe it’s time to go to bed

That was my conclusion, too – something at a lower level.

But Nigel’s later example suggests that order is ignored when you use “is in” with records.

You were fast :)… I found out that is in is still safe. So the order of the items is not important when comparing records, but it is when comparing list containing values. The comparison itself is more than just a record to list coercion:

set x to display dialog "display" default answer "answer"
set x to {zz:"zz", a:"a"} & x & {z:"z", aa:"aa"}

script scriptX
	on run argv
		return item 1 of argv
	end run
end script

set results to run script scriptX with parameters {x} --> an re-arranged record
({text returned:"answer", z:"z", zz:"zz"} as list) is in (results as list) --false
{text returned:"answer", z:"z", zz:"zz"} is in results --true

Hello.

So a record in AppleScript, is really at least, a simulated set of values, if not a set of values internally.

A set is a an unordered collection of elements, where there are no count of any similiar elements, a constraint of uniqueness of elements, is also possible, and then it is called a set of unique values.

Attributes and properties, of objects and records, often work like this: the last attribute/property of a kind, is the one that are used.

It is a good thing that AppleScript treats the record as a record, and not a list, and arranges the attributes of the record in some order to make the comparision easier of them when there is a test for likeness/containment. :slight_smile:

Wow, that opened a can of worms I wasn’t expecting.

Thanks everyone, I’ve got a better understanding now, and using {braces} properly for my subroutine results in a faster and more elegant subroutine than my nested repeats.

Oddly, I now realize that many years ago I came across this problem and ended up, by checking my variable values, eventually figuring out to put something in a (seemingly) “extra” set of braces for a comparison, but never dug in to really figure out what was going on and just ran with it. That was probably 10+ years ago and I’d forgotten entirely until I read this thread.

I was checking my variable values along the way this time and saw that the duplicates weren’t being caught because one value was remaining a record while the other was being coerced to a list, but I couldn’t figure out why it was doing that, and it didn’t occur to me that I could fix it with a simple set of braces. I just figured if I forced Applescript to obtain both items in an identical manner, I would either avoid the coercion, or force an identical coercion, so I wrote it that way, and got the desired result.

What an odd language I write in. I completely understand the thought behind dynamic variable types to make things simpler to the user, and they’re great when they work as expected… but I can’t believe how often my “bug” is an unexpected variable type, and then it’s hidden from me… like having the user choose from a list of numbers, and it returns the resulting number as text… My scripts are full of

(((PathToFolder as text) & "/document name") as POSIX file)

and

if (userChoice as number) is someNumber

and, here’s a good one,

set the keywords of info to {tabChoice, (pathData as list as string), "true"}

. Note I’m forcing the value “true” as text, which is a long story. Sometimes making things simple makes them much more difficult.

There are probably better way to do all these things, I usually just do the first thing I find that works. Which I know can come back to bite me when I lack a deeper understanding of why it worked, because that means I also don’t understand when it’s not going to work. So thanks again for the deeper understanding on this one.

  • t.spoon.