Hash table: A good alternative for records

DJ_Bazzie_Wazzie · September 4, 2012, 10:28pm

Hello all,

For educational reasons I will show and explain how hash tables works in it’s simplest form. There are other, and in some scenarios faster, solutions to skin this cat but that’s not the point of this topic. I would like to keep this topic plain AppleScript and has to work with hash function and the code needs to be easy to understand, any input is welcome that will meet this requirements.

For those who aren’t familiar with hash tables, a small explanation. A hash table will create an hash from it’s key, mostly a string, where the encrypted value meets an index in an array. When you have a list of key values pairs you replace the part of iterating through the list to lookup the key with a small hash function, resulting in a good performing key value pair list.

script hashTable
	on newInstance()
		script hashTableInstance
			property parent : hashTable
			property size : missing value
			property hashList : missing value
			property keyList : missing value
			property valueList : missing value
		end script
	end newInstance
	
	on init()
		set my size to 0
		set my hashList to {}
		repeat 1000 times
			set end of my hashList to {}
		end repeat
		set my keyList to {}
		set my valueList to {}
		return me
	end init
	
	on initWithKeysAndValues(_keys, _values)
		set my keyList to _keys
		set my valueList to _values
		set my size to count my keyList
		set my hashList to {}
		repeat 1000 times
			set end of my hashList to {}
		end repeat
		repeat with x from 1 to my size
			set end of item hashFunction(item x of my keyList) of my hashList to x
		end repeat
		return me
	end initWithKeyAndValues
	
	on setValueforKey(_key, _value)
		set x to hashFunction(_key)
		repeat with node in item x of my hashList
			if id of item node of my keyList = id of _key then
				set item node of my valueList to _value
				return true
			end if
		end repeat
		set my size to (my size) + 1
		set end of my valueList to _value
		set end of my keyList to _key
		set end of item x of my hashList to my size
	end setValueforKey
	
	on valueForKey(_key)
		set x to hashFunction(_key)
		repeat with node in item x of my hashList
			if id of item node of my keyList = id of _key then
				return item node of my valueList
			end if
		end repeat
		return missing value
	end valueForKey
	
	on keyExists(_key)
		considering case
			return my keyList contains _key
		end considering
	end keyExists
	
	on keys()
		return my keyList
	end keys
	
	on setKeys(_keys)
		set _keys to every text of _keys
		if (count _keys) = (count my keyList) then
			set my keyList to _keys
			--need to re-hash
			set my hashList to {}
			repeat 1000 times
				set end of my hashList to {}
			end repeat
			repeat with x from 1 to my size
				set end of item hashFunction(item x of my keyList) of my hashList to x
			end repeat
		end if
	end setKeys
	
	on valueExists(_value)
		return my valueList contains _value
	end valueExists
	
	on values()
		return my valueList
	end values
	
	on setValues(_values)
		if (count _values) = (count my valueList) then set my valueList to _values
	end setValues
	
	on count
		return my size
	end count

	on hashFunction(_key)
		set _hash to count _key
		--Credits to Shane Stanly here, supports single character keys now.
		repeat with char in (id of _key as list)
			set _hash to _hash + char
		end repeat
		return _hash mod 1000 + 1
	end hashFunction
end script

The methods:

newInstance() - class method
create an instance of hashtable
init() --instance method
Initialize the instance, all values and size will be set to 0
initWithKeysAndValues(_keys, _values) --instance method
Initialize the instance, and all given values are set
setValueforKey(_key, _value) - instance method
Insert a key and value into the list; if a key already exists then it will be updated with the value
Keys are case sensitive
valueForKey(_key) - instance method
Return the associated value by key; if a given key doesn’t exists it returns missing value
Keys are case sensitive
keyExists(_key) - instance method
Return a boolean whether a key already exists or not; note this method is case sensitive
keys() instance method
Return a list with all the keys
setKeys(_keys) - instance method
Set new keys for all the values. The keys must be a all of type string and length should match the number of current keys
Problem with big lists is that changing the keys that the whole table needs to be re-hashed
valueExists(_value) - instance method
Return a boolean whether a value is already in the list or not.
values() - instance method
Return a list with all the values
setValues(_values) - instance method
set the values for all keys. Can be usefull when you have a fixed list with a fixed key table
count - instance method
Returns the number of items in the list
hashFunction(_key) - class method
The function that returns the associated item in the hash table

Note: Difference between instance method and class method is that class method should be called from the hashtable object immediatly and instance method should be called from the object returned by the newInstance class method.

USAGE:
First we can create an empty list by using:


set dict to hashTable's newInstance()'s init()

Now we’ve created a instance that is initialized and contains no data. To create a list with values you can use:

set dict to hashTable's newInstance()'s initWithKeysAndValues({"First", "Second", "Third"}, {"Red", "Green", "Blue"})

now to lookup an value you can use:


set dict to hashTable's newInstance()'s initWithKeysAndValues({"First", "Second", "Third"}, {"Red", "Green", "Blue"})
dict's valueForKey("Second")

Which returns “Green” but keep in mind that the key’s are case sensitive. So when I lookup “second” it would return missing value instead of Green. To avoid this behavior you should capitalize or uncapitalize the key string matches and hashes function to make this case insensitive.

To add an value we should use:

set dict to hashTable's newInstance()'s initWithKeysAndValues({"First", "Second", "Third"}, {"Red", "Green", "Blue"})
dict's setValueforKey("Last", {1, 2, 3, 5, 6, 7, 8})
dict's valueForKey("Last")

As you can see we can insert other object that string only. You can add any kind of basic AppleScript objects you want like string, numbers, integer, reals, records, lists and data.

To update an item we use the same code as inserting one. When an key exists the old associated value will be overwritten, this makes it also impossible to have two of the same keys. This is a feature not a bug :D.

set dict to hashTable's newInstance()'s initWithKeysAndValues({"First", "Second", "Third"}, {"Red", "Green", "Blue"})
dict's setValueforKey("Second", "Yellow")
dict's valueForKey("Second")

when you don’t want to overwrite you can use keyExists(_key) method to check if a key exists before you want to set/add the key value pair


set dict to hashTable's newInstance()'s initWithKeysAndValues({"First", "Second", "Third"}, {"Red", "Green", "Blue"})
if not dict's keyExists("Second") then
	dict's setValueforKey("Second", "Yellow")
end if
dict's valueForKey("Second")

You can also set immediatly the values without rehashing which can be usefull for things like profiles (fixed key and value list)


set dict to hashTable's newInstance()'s initWithKeysAndValues({"First", "Second", "Third"}, {"Red", "Green", "Blue"})
dict's setValues({"Cyan", "Magenta", "Yellow"})
dict's valueForKey("Second")

The same can be applied for keys but remember that the list of keys can only contain strings and it needs to rehash the key table. So as long as dictionary contains less than 3000 items there is no problem/lag but when it’s 10,000 it needs some time to hash.

set dict to hashTable's newInstance()'s initWithKeysAndValues({"First", "Second", "Third"}, {"Red", "Green", "Blue"})
dict's setKeys({"Color 1", "Color 2", "Color 3"})
dict's valueForKey("Color 3")

At last, and maybe weird is that count command. Some commands can be overwritten with script objects so you can use the count command for the instance:


set dict to hashTable's newInstance()'s initWithKeysAndValues({"First", "Second", "Third"}, {"Red", "Green", "Blue"})
count of dict
count dict --is also correct
number of dict --can also be used

So I hope this is enough information and having fun with it!

Already some FAQ answers

Why an hash table size of 1000?
The main reason for that is AppleScript seems to be comfortable with arrays containing up to 1000 items. While the lookup for the hash wouldn’t have any problem with a list containing 100,000 items, creating them is so slow you wouldn’t use this script object. So creating a bigger hash table is only overhead and initialization would take longer.

Shouldn’t it be smaller then?
No because the hash function will output an index by it’s key value, the larger the hash table the less collisions we have (multiple indexes in the node) and the faster the item will be returned.

Aren’t there other/better hash tables?
Definitely! There are plenty different solutions but I wanted it to keep simple as possible so the code speaks for itself. When someone knows the trick he can create more complex hash function resulting in a better hash code. However, since 10.10 ASObjC is supported in AppleScript so I would use an NSDictionary instead. Again this is only for educational purpose.

What’s the main advantage of hash tables?
Apart from being an associative array the main advantage of hash tables is that no matter how large a list gets, it should only affect the lookup performance for a tiny bit (read: collisions). When you look up data from a list containing 100 items, a repeat loop will be fast. But when you need to loop through an list of 100,000 items for a couple of times it would be a lot more data handling. So when the list contains 100 or 100,000 the performance difference should be very small.

Does the ‘performance’ advantage also applies to this hash table?
Keep in mind that AppleScript and it’s foundation is all written in C, it’s compared to AppleScript a very high performance programming languages. That implies that self written implementations in plain AppleScript is slower than AppleScript’s own implementations written in C (not using an AppleEvent). So compared to lists and records this is slow, but when you lookup values by iterating in plain AppleScript this performance advantage does apply also in this simple hash table.

So why should I use an hash table when I can’t use it in a way as in other programming languages?
Well like in my first line, this post is for educational purposes. Another thing is that even creating a list containing 10,000 items are slow, you can still lookup an item in a blink of an eye. There are many cases records or other alternatives are better but this hash table can lookup keys for existence and their keys doesn’t need to be set at compile time which is required for records.

Why is it case sensitive?
Some programming languages does support associative arrays with case insensitive keys but I prefer case sensitive and using lower case key names. It’s something personal and also in this context quite easier otherwise I had to convert the keys to lower or uppercase first in the hash function so that the outcome of “KEY” and “key” are the same.

Edit 1: Shane Stanley suggested to change the loop to support single character keys
Edit 2: Nigel suggested to change the local variable name ‘hash’ because it collide with satimage users. Therefor I changed the local variable ‘hash’ into ‘_hash’. I also changed dictionary into dict
Edit 3: Shane pointed me to a translation error I made. The method initWithKeysAndValues is now spelled correctly
Edit 4: Updated some information because some of it doesn’t apply or AppleScript provides better alternatives by itself now.

Adam_Bell · September 5, 2012, 12:00am

Very slick. Thank you for the thorough entry.

Nigel_Garvey · September 5, 2012, 8:35am

Thanks, DJBW.

I haven’t had time to look at this in detail (we’re expecting a visitor today), but immediately obvious problems are:

‘hash’ is a command in the Satimage OSAX and causes the script to error.
‘size’ and ‘count’ are reserved words in AppleScript and shouldn’t be used out of context.
‘dictionary’ is also a reserved word on my system (I haven’t tracked it down) and prevents the script from compiling.

Instances of the complete words ‘hash’, ‘size’, and ‘dictionary’ can be changed to ‘|hash|’, ‘|size|’, and ‘|dictionary|’ respectively if anyone has problems. The ‘count’ intercept handler can and should be replaced with an ordinary handler ” say ‘|count|()’ ” to clarify the code and distinguish between the handler and the times ‘count’ is used normally. The handler could even be dispensed with altogether, since it only executes a three-word line.

DJ_Bazzie_Wazzie · September 5, 2012, 9:28am

First of all thanks for the replies Adam and Nigel!

Nigel, I’ve changed the hash variable name into _hash to avoid collisions with satimage users. I’ve also changed dictionary to dict because on one of my machines it seems to be problem as well (xmltools osax installed).

For count I leave it like that. I’ve always considered this being part of overwriting default commands; semi-sub classing. It executes number of, count of and count so it’s fully supported IMO. Overwriting preserved keywords is no problem as long as these properties are preserved properties like name, id, length, size and contents for example. You can’t use property names like file and weekday and you’re correct about that. That only applies for script objects properties and not locals or globals, for those you should never use preserved keywords IMO.

Shane_Stanley · September 5, 2012, 11:45pm

DJ,

Can I suggest another change? Rather than starting like this:

script hashTable
   on newInstance()
       script hashTableInstance
           property parent : hashTable
           property size : missing value
           property hashList : missing value
           property keyList : missing value
           property valueList : missing value
       end script
   end newInstance

You could start with this:

script hashTable
	property size : missing value
	property hashList : missing value
	property keyList : missing value
	property valueList : missing value
	
	on newInstance()
		return me
	end newInstance

By doing it that way, it becomes more inheritable.

For example, suppose I wanted the methods you provided, except sometimes I want to remove your ban on duplicate keys. You’re not actually enforcing it in most of your handlers, which means I can use them as is. So if you wrote it as above I could effectively subclass your script something like this:

script hashTableWithDupes
	property parent : hashTable
	
	on setValueforKey(_key, _value)
		set x to hashFunction(_key)
		set my size to (my size) + 1
		set end of my valueList to _value
		set end of my keyList to _key
		set end of item x of my hashList to my size
	end setValueforKey
	
	on valueForKey(_key)
		set resultList to {}
		set x to hashFunction(_key)
		repeat with node in item x of my hashList
			if id of item node of my keyList = id of _key then
				set end of resultList to item node of my valueList
			end if
		end repeat
		return resultList
	end valueForKey
	
end script

And thus:

set dict to hashTableWithDupes's newInstance()'s initWithKeysAndValues({"First", "Second", "Second"}, {"Red", "Green", "Blue"})
dict's valueForKey("Second")
--> {"Green", "Blue"}

alastor933 · September 6, 2012, 7:39am

Great stuff, DJ, especially for someone who only recently started tinkering with objects and associative lists. Gives me something to chew on, now that evenings in the garden come to an end.

Don’t spoil the fun by explaining, but: the list ‘hashList’ is (part of) a device to quickly find a key/value, or maybe even get them immediately without having to search. That about right?

Same with associative list. I have since found a specific application (that is, applied the device) where duplicate keys must be allowed (*). Uniqueness of keys is not imposed by the device per sÃ© (right?), but from DJ’s remark I’m guessing that there are some who feel keys must be unique, and others who don’t. I would be grateful for a summary of pros and cons of either approach.

(*) Another scripting aid, this one to support developing Pashua interfaces.

DJ_Bazzie_Wazzie · September 6, 2012, 9:24am

True, but it becomes a singleton and not a real instance.

set theList to {}
set end of theList to hashTable's newInstance()'s initWithKeysAndValues({"First", "Second", "Third"}, {"Red", "Green", "Blue"})
set end of theList to hashTable's newInstance()'s initWithKeysAndValues({"First", "Second", "Third"}, {"Cyan", "Magenta", "Cyan"})
valueForKey("Second") of item 1 of theList

When newInstance() returns me (singleton) the code above returns “Magenta” while my original method returns “Green” because there are two instances in the list. You could of course solve this by using a copy command but then you copy the whole script object which is in my opinion is a bit overhead in memory and takes up a lot of time as well.

What I do, your hashTableWithDupes example is great and uses inheritance a lot that way, is copying the newInstance method to the ‘sub’ script and change the parent property. The script will follow the parenting chain without problems and you only need 1 extra action.


script hashTableWithDupes
	property parent : hashTable
	
	on newInstance()
		script hashTableInstance
			property parent : hashTableWithDupes
			property size : missing value
			property hashList : missing value
			property keyList : missing value
			property valueList : missing value
		end script
	end newInstance
	
	on setValueforKey(_key, _value)
		set x to hashFunction(_key)
		set my size to (my size) + 1
		set end of my valueList to _value
		set end of my keyList to _key
		set end of item x of my hashList to my size
	end setValueforKey
	
	on valueForKey(_key)
		set resultList to {}
		set x to hashFunction(_key)
		repeat with node in item x of my hashList
			if id of item node of my keyList = id of _key then
				set end of resultList to item node of my valueList
			end if
		end repeat
		return resultList
	end valueForKey
end script

I think it’s a programmer’s perference what you need and want and yours as mine have both their cons and pros. So I hope you don’t mind leaving it as it is.

Thank you very much for your input Shane

DJ_Bazzie_Wazzie · September 6, 2012, 9:41am

Correct (partly)! There are collisions which you may find out yourself, don’t want to spoil the fun, which needs to be caught. You can play with the hash function and change it, it very easy and straightforward at the moment.

Shane_Stanley · September 6, 2012, 11:26am

Good point about it being a singleton. I guess I just have trouble with the idea of not inheriting properties/ivars.

Shane_Stanley · September 6, 2012, 11:28am

I think it’s more a case that sometimes what you want to do requires one or other approach.

DJ_Bazzie_Wazzie · September 6, 2012, 12:20pm

You’re absolutely right and, either way, I think it’s not a nice solution but the best we can get. This, or your way, is the closest you can get but far away from full OOP. Once used to OOPs like C++, PHP or (AppleScript)Objective-C I’m getting a little mixed up feelings when I’m back in AS with script objects again. Simplicity isn’t always better