search a very long list for an item

Nigel_Garvey · April 8, 2007, 11:17pm

Hi, both.

The binary search is only about three times as fast on my machine, but that’s still pretty good! Thanks, Stefan.

Here’s a slightly optimised version. Besides the obvious script object, the ‘high’ and ‘low’ variables are handled more economically. Also, on the theory that the item’s more likely to be in the list than not, that test’s left to the end, when there’s only one item to test:

BinarySearch(get paragraphs of (do shell script "look a"), "anteater" as Unicode text)

on BinarySearch(TheList, value)
	script o
		property l : TheList
	end script
	
	set low to 1
	set high to (count TheList)
	repeat until (low = high)
		set mid to (low + high) div 2
		if ({value} is in items low thru mid of o's l) then
			set high to mid
		else
			set low to mid + 1
		end if
	end repeat
	
	if (item low of o's l is value) then return low
	return false
end BinarySearch

StefanK · April 9, 2007, 8:29am

Very neat, Nigel.

I think, you can often save an extra comparison loop, if you decrement the high value like incrementing the low one.
For this change the repeat condition has also to be adapted

on BinarySearch(theList, value)
	local low, mid, high
	script o
		property l : theList
	end script
	
	set low to 1
	set high to (count theList)
	repeat while (low â‰¤ high)
		set mid to (low + high) div 2
		if ({value} is in items low thru mid of o's l) then
			set high to mid - 1
		else
			set low to mid + 1
		end if
	end repeat
	
	if (item low of o's l is value) then return low
	return false
	
end BinarySearch

Nigel_Garvey · April 9, 2007, 9:46am

Hi, Stefan.

Just looking at this quickly, I suspect that it might not find the item if the item’s exactly at a ‘mid’ position. I have to go out now, but I hope to look at this subject again tonight. As with a linear seach, the time taken by a binary search depends on whereabouts in the list the item actually is. For instance, although the binary search is much faster at finding “anteater” in Kel’s list, the linear search seems to have a slight advantage when looking for “aardvark”.

StefanK · April 9, 2007, 10:16am

This is obvious, as long as the position of the item to find is less than the number of loops plus the code overhead of the binary search.
By contrast the binary search is incredibly faster if the position is near to the end of the list.
On average a binary search is always faster than a linear search

Nigel_Garvey · April 9, 2007, 9:09pm

Hi, Stefan.

I think I agree with that. And when the linear search is faster, it’s not by very much, so the binary search is generally a safe bet.

Contrary to my intial guess, your effort to save the extra comparison loop does find items in ‘mid’ positions. But it doesn’t, on average, in my tests, save any time. Sometimes it does fewer loops than my optimisation, sometimes more, sometimes the same number. The totals for the two methods over the first (list length - 1) items seem to be exactly the same, but finding the very last item in the list always takes one more loop with your optimisation than with mine.

Test method: log {low, mid, high} on each loop and call the handler with a logged loop that finds every item in the list. (The items must be unique, of course.) Use similar scripts for the two optimisations, vary the length of the input list, compare the event logs.

on BinarySearch(theList, value)
	script o
		property l : theList
	end script
	
	set low to 1
	set high to (count theList)
	repeat until (low = high)
		set mid to (low + high) div 2
		log {low, mid, high}
		if ({value} is in items low thru mid of o's l) then
			set high to mid
		else
			set low to mid + 1
		end if
	end repeat
	
	if (item low of o's l is value) then return low
	return false
end BinarySearch

set theList to {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "m", "o", "p"}
repeat with i from 1 to (count theList)
	log BinarySearch(theList, item i of theList)
end repeat

StefanK · April 10, 2007, 9:00am

You’re right, Nigel, your script is a bit faster.

Comparing directly both scripts with Lotsa (1000 loops and a list of 1000 items)
your version takes 49.066 seconds and mine 49.566 seconds

Vincent · April 10, 2007, 11:22am

For multiple occurences in the list this modification should work:

set theList to {"a", "b", "c", "d", "e", "f", "a", "g", "h", "i", "j", "k", "a", "l", "m", "m", "o", "p"}
MultiBinarySearch(theList, "a", 0)

on MultiBinarySearch(theList, value, start)
	script theScript
		property refList : theList
	end script
	if {value} is not in theScript's refList then return false
	set indexes to {}
	set low to 1
	set high to (count theScript's refList)
	repeat until (low = high)
		set mid to (low + high) div 2
		--log {low, mid, high}
		if ({value} is in items low thru mid of theScript's refList) and ({value} is in items (mid + 1) thru high of theScript's refList) then
			set indexes to indexes & MultiBinarySearch(items (mid + 1) thru high of theScript's refList, value, (mid + 1))
			set high to mid
		else if ({value} is in items low thru mid of theScript's refList) then
			set high to mid
		else
			set low to mid + 1
		end if
	end repeat
	if (item low of theScript's refList is value) then
		set beginning of indexes to (low + start)
		return indexes
	end if
end MultiBinarySearch
--result: {1, 14, 8}

(Results are not in chronicle order, though.
Some routine like QuickSort would have to do the rest:
http://bbs.applescript.net/viewtopic.php?id=17340)

mghilissen · April 13, 2007, 7:18pm

To search for multiple occurrences of one value in a list, use this one, rather:


set theValue to "a"
set theList to {"a", "b", "c", "d", "e", "f", "a", "g", "h", "i", "j", "k", "a", "l", "m", "m", "o", "p"}
Qsearch(theList, 1, count theList, theValue)
to Qsearch(array, leftEnd, rightEnd, theValue)
	script A
		property l : array
	end script
	set theIndexes to {}
	set {i, j} to {leftEnd, rightEnd}
	set v to ((leftEnd + rightEnd) div 2) -- pivot in the middle
	repeat while (j > i)
		repeat while (i is not greater than v)
			if item i of A's l = theValue then
				set theIndexes to theIndexes & {i}
				exit repeat
			else
				set i to i + 1
			end if
		end repeat
		repeat while (j is greater than v)
			if item j of A's l = theValue then
				set theIndexes to theIndexes & {j}
				exit repeat
			else
				set j to j - 1
			end if
		end repeat
		if (i is not greater than j) then
			set {i, j} to {i + 1, j - 1}
		end if
	end repeat
	if (leftEnd < j) then Qsearch(A's l, leftEnd, j, theValue)
	if (rightEnd > i) then Qsearch(A's l, i, rightEnd, theValue)
	theIndexes
end Qsearch

-- The Hoare's QuickSort algorithm will put the result in sequential order.

– Result: {1, 13, 7}

Nigel_Garvey · April 13, 2007, 10:08pm

Thanks to Vincent and mghilissen for the idea! This version returns the indices in the right order:

on MultiBinarySearch(theList, value, l, r)
	script o
		property lst : theList
		property indices : {}
		
		on mbsrch(l, r)
			if (item l of my lst is value) then set end of my indices to l
			
			set l to l + 1
			set m to (l + r) div 2
			if (m â‰¥ l) and ({value} is in items l thru m of my lst) then mbsrch(l, m)
			
			set m to m + 1
			if (r â‰¥ m) and ({value} is in items m thru r of my lst) then mbsrch(m, r)
		end mbsrch
	end script
	
	o's mbsrch(l, r)
	
	return o's indices
end MultiBinarySearch

set theList to {"a", "b", "c", "d", "e", "f", "a", "g", "h", "i", "j", "k", "a", "l", "m", "m", "o", "p", "a", "e"}
MultiBinarySearch(theList, "a", 1, (count theList))
--> {1, 7, 13, 19}

[i]Edit: Simplified the code to clarify its working and to remove an “optimisation” that was actually having slightly the opposite effect!

I’ve had a chance to study Vincent’s and mghilissen’s scripts this morning. I hope they won’t mind if I comment here.

Vincent’s script returns the right number of indices, but those returned from lower recursion levels aren’t correct. The value passed to ‘start’ in the recursive calls should be ‘(mid + start)’, not ‘(mid + 1)’.

mghilissen’s script is based on the Quicksort algorithm and is actually a series of converging linear searches that find the first instance of the value, then the last, then the second, then the second-to-last, and so on. For this reason, it’s slightly slower than a single linear search straight through the list. The two recursive calls at the end contribute nothing to the final result but add a terrific amount to the running time.

Binary searches are only effective “ and are only necessary “ because AppleScript’s a “high level” language. The ‘is in’ command actually carries out a linear search itself, but at a lower coding level. This is much faster than using high-level AppleScript commands to carry out the individual details of a linear search. (In much the same way, a person will walk across the room much faster if you tell them to walk across the room than if you give them individual instructions for lifting each leg, shifting their weight forward, and falling in a controlled manner back onto the foot of the current leg. There’s probably a similar allegory to do with Government interference and the National Health Service, but I won’t go into that here. )[/i]

mghilissen · April 14, 2007, 9:21pm

I am absolutely delighted to take lessons from you, Nigel. Only perfect practice makes perfect. Thanks for your insight.

Michael

StefanK · April 14, 2007, 9:59pm

This is awesome, Nigel

Vincent · April 14, 2007, 10:23pm

Ditto!:D:cool:

McUsrII · September 30, 2013, 11:45am

Hello.

I just want to add to this thread, that using binary search, for searching for values in list of lists, is useful too. Well, it is not that easy to ( or you can’t really) find the index of the list that is contained by the list.
So, in such cases where you want to search for items that are contained in a list of lists, you are better off constructing a list with the index values, and use binarysearch on that one.

Yvan_Koenig · September 30, 2013, 2:56pm

I ran the Nigel’s script on a list of 4326 datestrings.
It took less than one second to return the list of 21 values matching the searched key.

So, I immediately added the handler to my scripts library.

KOENIG Yvan (VALLAURIS, France) lundi 30 septembre 2013 16:56:07

Shane_Stanley · October 1, 2013, 2:07am

I wonder if that’s what you really meant to say. Seems to me they’re certainly effective in other languages, although the potential gain – and hence tipping point from advantageous to necessary – is obviously greater in high-level languages like AS.

McUsrII · October 1, 2013, 2:23am

And it saves programmer time too! -In using binarysearch, spares you from a lot l of steps to iterate over while debugging your code, compared to looking up elements one after the other.

ame · October 1, 2013, 8:43am

Hi Nigel,

Great routine!!! Very essential and fastest.

Many thanks for your contribution

Ame

DJ_Bazzie_Wazzie · October 1, 2013, 10:47am

If you look closely to Nigel’s code you’ll see that it’s not a binary search and he takes fully advantage of the fast underlying linear search from AS (contains) using some of the binary search technique. A binary search only works on ordered lists while Nigel’s approach doesn’t need a an ordered list which makes his code great. Instead of using the greater than and less than operators he’s using the contains and not contains operators instead.

Nigel_Garvey · October 1, 2013, 11:22am

It was six and a half years ago. I don’t even remember writing the handler.

But I broadly agree with the statement the context in which it was written ” that is, with regard to a search of a randomly ordered list. In this case, a binary search has repeatedly to ascertain if a value exists in one of two subranges of the list and home in on it there if it does or home in on it in the other subrange if it doesn’t. At the lowest level, the ascertaining can only be done with linear searches of the tested subranges, so a hit result may as well be returned directly from one of these. It just so happens that the several range extractions, subrange re-extractions, and sometimes overlapping low-level linear searches involved in repetitions of the AppleScript ‘. is in items . thru . of .’ are statistically faster than a linear search of the list at the AppleScript level. Obviously the actual time difference in any one case depends on whereabouts in the list the value occurs.

However, when dealing with a list known to be ordered, the criterion for each subrange selection is simply whether the sought value is less than or greater than the value at the current index. This kind of binary search, I agree, is possible and potentially effective in low-level languages. And by “low-level”, I’m thinking “assembler” rather than any intermediate language between this and AppleScript.

Shane_Stanley · October 1, 2013, 11:35am

Yes, I should have done you the courtesy of at least looking at the code :rolleyes: I saw “binary search” and assumed the classical binary search on sorted data.