Approximate string matching: The Levenshtein distance

Nigel_Garvey · June 1, 2015, 2:39pm

And one may want to consider using a ‘considering case’ statement, either round the main repeat in the handler or round the call to the handler itself.

McUsrII · June 1, 2015, 3:05pm

Absolutely.

DJ_Bazzie_Wazzie · June 1, 2015, 3:34pm

I haven’t timed the implementations but I think Nigel’s version on efficiency is the best version of all. In C for example, Nigel’s implementation would not be faster (marginally slower due to excessive allocations and freeing of memory) but when used with +1k data his memory usage would be much lower while the bandwidth remained the same due to using a much smaller variable buffer size.

Nigel_Garvey · June 1, 2015, 4:28pm

It’s possible to modify the rolling two-row version to reuse the same two “row” lists throughout the process:

set string1 to "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences."
--same as string 1 but with a typos (deletion, insertion, altering and transposition)
set string2 to "In informatoin theory and computer science, thee Levenshtein distance is a string metric far measuring the diffrence between two sequences."

set edits to levenshteinDistance(string1, string2)

on levenshteinDistance(s1, s2)
	set xLen to (count s1)
	set yLen to (count s2)
	
	script o
		property charList1 : characters of s1
		property charList2 : characters of s2
		property previousRow : missing value
		property currentRow : missing value
	end script
	
	-- Intitialise both "row" lists as the first row.
	set o's previousRow to {0} & o's charList1
	repeat with x from 1 to xLen
		set item (x + 1) of o's previousRow to x
	end repeat
	set o's currentRow to o's previousRow's items
	-- Handle the remaining rows in a rolling manner, the two lists alternating as the previous and current rows.
	repeat with y from 1 to yLen
		set item 1 of o's currentRow to y
		repeat with x from 1 to xLen
			set deletion to (item x of o's currentRow) + 1
			if (item x of o's charList1 is item y of o's charList2) then
				set alternate to (item x of o's previousRow)
			else
				set alternate to (item x of o's previousRow) + 1
			end if
			set min to (item (x + 1) of o's previousRow) + 1
			if deletion < min then set min to deletion
			if alternate < min then set min to alternate
			
			set item (x + 1) of o's currentRow to min
		end repeat
		
		tell o's previousRow
			set o's previousRow to o's currentRow
			set o's currentRow to it
		end tell
	end repeat
	
	return end of o's previousRow
end levenshteinDistance

McUsrII · June 1, 2015, 4:29pm

First, I clocked my version to use 1.26 seconds, while your second version used 0.47 seconds, then I performed the same optimizations as you had done, and I ended up at 0.3 seconds for it on my machine, then Nigel comes a long with his version at a whopping 0.27 seconds; all timings based on 100 iterations over the strings you provided.

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage.

DJ_Bazzie_Wazzie · June 1, 2015, 6:56pm

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn’t matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel’s code because it’s efficiency in used resources.

Thanks for the times

McUsrII · June 1, 2015, 7:07pm

DJ Bazzie Wazzie:

McUsrII:

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage.

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn’t matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel’s code because it’s efficiency in used resources.

Thanks for the times

I read a whole lot about memory allocation, my point is, that instead of allocating memory again and again, and thereby using bandwidth over the bus, you could really just preallocate some static arrays, which you would align with the size of text you read in for each call, then you’d just reinitialize the arrays for each call. The beauty lies in the iterations, here you could just flip two pointers, so that they point to the other array, and access v1 and v0 through those pointers, which is a quite fast alternative to any copying. Maybe we could have utilized that trick in Applescript too, but I doubt it would have a large positive impact.

DJ_Bazzie_Wazzie · June 1, 2015, 7:28pm

McUsrII:

DJ Bazzie Wazzie:

McUsrII:

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage.

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn’t matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel’s code because it’s efficiency in used resources.

Thanks for the times

I read a whole lot about memory allocation, my point is, that instead of allocating memory again and again, and thereby using bandwidth over the bus, you could really just preallocate some static arrays, which you would align with the size of text you read in for each call, then you’d just reinitialize the arrays for each call. The beauty lies in the iterations, here you could just flip two pointers, so that they point to the other array, and access v1 and v0 through those pointers, which is a quite fast alternative to any copying. Maybe we could have utilized that trick in Applescript too, but I doubt it would have a large positive impact.

I know, I was maybe ahead of you and I meant that there would no performance gain for the “v1 and v0” solution either. Even swapping pointers, how small it is in execution time, no swapping is faster. When using static arrays, using stack memory instead of heap, does have a lot of performance gains but as always, using stack memory limits your buffers. It’s an unwritten rule not to use stack memory for these kind of data.

McUsrII · June 1, 2015, 7:40pm

DJ Bazzie Wazzie:

McUsrII:

DJ Bazzie Wazzie:

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn’t matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel’s code because it’s efficiency in used resources.

Thanks for the times

I read a whole lot about memory allocation, my point is, that instead of allocating memory again and again, and thereby using bandwidth over the bus, you could really just preallocate some static arrays, which you would align with the size of text you read in for each call, then you’d just reinitialize the arrays for each call. The beauty lies in the iterations, here you could just flip two pointers, so that they point to the other array, and access v1 and v0 through those pointers, which is a quite fast alternative to any copying. Maybe we could have utilized that trick in Applescript too, but I doubt it would have a large positive impact.

I know, I was maybe ahead of you and I meant that there would no performance gain for the “v1 and v0” solution either. Even swapping pointers, how small it is in execution time, no swapping is faster. When using static arrays, using stack memory instead of heap, does have a lot of performance gains but as always, using stack memory limits your buffers. It’s an unwritten rule not to use stack memory for these kind of data.

Maybe you were ahead of me, I don’t know, I just had the idea of using static storage, whether by declaring reasonably sized arrays up front, or allocating memory from the heap at program startup. And just for the record, Nigel seem to use the “pointer-flipping” optimization already, and I just named them v0 and v1, because I had forgotten what Nigel called them. I mentioned it, because you worried over bandwidth and memory allocations. My point is that those two problems can be almost optimized away with respect to variables that the function uses, save during program load and exit. I had no intention, what so ever of allocating the arrays on the stack.

Edit

I too are aware of the fact that all static variables are created in the data segment? (some segment, anyway), so, no matter how big static variables I declare in a C-function, they will not be allocated on the program stack at run time, like automatic variables.

DJ_Bazzie_Wazzie · June 2, 2015, 12:03am

Just for the record, that behaviour depends on the compiler (settings). Statics can be assigned to the data (initialized), bss(uninitialized) and code(both) segment of the executable. Also it would make no difference whether or not to use statics in this situation, unless you make it recursive. Therefore my reference to stack memory was more general in terms as a fixed memory space that is not heap memory and are all bound to the same limitations.

I don’t worry over bandwidth at all :), I think you misunderstood me. My point was that even when using Nigel’s approach you would use less memory even without gaining any performance. The reason for me to say that explicitly was because the topic looks like (read the timings) that the best implementation is the fastest, without regards to quality of the code itself. So without timing any of the solutions I liked Nigel’s script the most, even if it was slower (which is not).

McUsrII · June 2, 2015, 7:32am

Hello.

I agree in that Nigels second version, is also the most readable, so that is a good reason for me to like it to. After all, performance is just a currency, that we use to “buy” security and functionality for, so it is important, but not the most important property of a routine really, of course, unless it really is, for one routine.

Nigel_Garvey · June 3, 2015, 12:28pm

A couple of further possibilities:

Make ‘charList1’ and ‘charList2’ the ‘id’ of the respective strings rather than the ‘characters’. This will make the character comparisons sensitive to everything and speed them up because they’ll be integer rather than text comparisons.
Test the strings (or the ‘id’ lists) for equality before proceeding with the rest of the process. This will greatly speed up the return of 0 for equal strings without noticeably slowing the treatment of unequal strings.

set string1 to "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences."
--same as string 1 but with a typos (deletion, insertion, altering and transposition)
set string2 to "In informatoin theory and computer science, thee Levenshtein distance is a string metric far measuring the diffrence between two sequences."

set edits to levenshteinDistance(string1, string2)

on levenshteinDistance(s1, s2)
	set xLen to (count s1)
	set yLen to (count s2)
	
	script o
		property charList1 : id of s1 -- For everything sensitivity .
		property charList2 : id of s2 -- . and speed.
		property previousRow : missing value
		property currentRow : missing value
	end script
	
	-- Return 0 straight away if the two strings are equal.
	if (o's charList1 = o's charList2) then return 0
	
	-- Otherwise intitialise two "row" lists as the first row of a notional matrix.
	set o's previousRow to {0} & o's charList1
	repeat with x from 1 to xLen
		set item (x + 1) of o's previousRow to x
	end repeat
	set o's currentRow to o's previousRow's items
	-- Handle the remaining rows in a rolling manner, the two lists alternating as previous and current rows.
	repeat with y from 1 to yLen
		set item 1 of o's currentRow to y
		repeat with x from 1 to xLen
			set deletion to (item x of o's currentRow) + 1
			if (item x of o's charList1 is item y of o's charList2) then
				set alternate to (item x of o's previousRow)
			else
				set alternate to (item x of o's previousRow) + 1
			end if
			set min to (item (x + 1) of o's previousRow) + 1
			if (deletion < min) then set min to deletion
			if (alternate < min) then set min to alternate
			
			set item (x + 1) of o's currentRow to min
		end repeat
		
		tell o's previousRow
			set o's previousRow to o's currentRow
			set o's currentRow to it
		end tell
	end repeat
	
	return end of o's previousRow
end levenshteinDistance

McUsrII · June 3, 2015, 1:29pm

Hello Nigel.

Great ideas for optimizing the handler. I at least can se no reason why anybody would want such a handler to be case insensitive. I haven’t timed it, but I wager it is alot faster.

DJ_Bazzie_Wazzie · June 3, 2015, 1:57pm

Spelling corrections, where it is used most, is a mix of case sensitive and case insensitive. Also my finished project had to be case insensitive. However, when you still want it case insensitive you can convert both strings to the same case (lowercase) before calling this function.

McUsrII · June 4, 2015, 2:12pm

Hello.

Here is another idea, regarding spelling distances of filenames. The idea, is that a user types in a filename, or something else that is word sized, and you can iterate of the contents of the container of the items, and return the one with the least distance.

on spdist for s1 against s2
	-- returns coarsely the differences between two strings
	-- 0 if strings are identical
	-- 1 if two chars are transposed, or one is different
	-- 2 if one char added or deleted
	-- 3 otherwise.
	
	-- it is really meant solve problems with filenames that are
	-- misspelled by one character 
	-- Ported from "The Unix Programming Environment" p.211 
	-- by B.W Kernighan and R. Pike.
	
	script o
		property l : s1's id
		property m : s2's id
	end script
	
	set l1 to length of s1
	set l2 to length of s2
	
	if l1 = l2 then
		set diffs to 0
		set ante to 0
		repeat with i from 1 to l1
			if item i of o's l â‰  item i of o's m then
				if diffs = 0 then
					if ante = 0 then -- first error 
						set ante to 1 -- 1 spelling error.
					else if ante = 1 then
						if item i of o's l = item (i - 1) of o's m and item (i - 1) of o's l = item i of o's m then
							set diffs to 1 -- one full transpostion.
						else
							set diffs to 3 -- two spelling errors.
						end if
						set ante to -1
					end if
				else
					set diffs to 3
				end if
			end if
		end repeat
		if diffs = 0 and ante = 1 then set diffs to 1
	else
		if l1 > l2 then
			if (l1 - l2) < 2 then
				set diffs to 2
			else
				set diffs to 3
			end if
		else
			if (l2 - l1) < 2 then
				set diffs to 2
			else
				set diffs to 3
			end if
		end if
	end if
	return diffs
end spdist

Edit

So I mulled over it a little, and I changed the code to work similiar to the description, almost: A transpositon, may not be an actual transposition, and, a single misspelled character also counts as a transposition. So in the worst case, two independently misspelled characters counts as one transposition.

Edit++
It is changed slightly so it now differs between a “true” transposition, and different characters at two different places.

McUsrII · June 11, 2015, 11:14am

Hello.

I just found out that Levenshtein distances may be used for computing queries. I just provide a link for those interested. Levenshtein Automata (2010) | Hacker News. -You reach the original link by clicking on it at the top of the page.

DJ_Bazzie_Wazzie · June 11, 2015, 11:52am

Yes, search engines uses them too but in a whole different way.

McUsrII · June 11, 2015, 12:54pm

Well, it is interesting, I haven’t but skimmed that article yet. Here is another one covering editing distances, and an optimal algorithm, which I have no time to read at the moment. Computer scientists prove that a 40-year-old algorithm is optimal | Hacker News

Just mentioning it, I use the fact that something like this have been used by Google, many times, when I wonder what an english word means, but haven’t got the spelling right. Last one was earlier today, I wondered what itinary meant, so I tried Spotlight first, but the spelling was wrong. Using Google however, gave me back results for itinerary immediately.

Edit

Building a spell checker, must be a very interesting project, with regards to the trade-offs of the datastructures involved. I see that nearly every one of them uses the Levenshtein distance.

DJ_Bazzie_Wazzie · June 11, 2015, 2:38pm

They are, but they differ a lot on how they are implemented. Almost no spellings checker is actually processing entire words, the levenshtein itself is done in one of the last steps of the spelling check. Only pieces of words who differs is the levenshtein applied to. For instance when matching the typo “backkup” against “backup”, there is no levenshtein match required when the spelling dictionary is syllable based b-tree. It is when a typo like “backop” is made, then the levenshtein is used to match “op” against “up” (and all other possible remaining syllables).

While the algorithm is quite heavy to process, the implementations are done in such a way that the levenshtein is actually applied to a smaller piece of the word against an already narrowed down dictionary during preprocessing. This makes it possible to apply approximate matching and give suggestions on the fly.

Since you mentioned Google. It is using another algorithm than a spelling checker. Because Google is using statistics, the levenshtein will be indexed and therefore it doesn’t need to be processed each time you write a typo. Today’s suggestion can be somethings else tomorrow when using the same search string.

McUsrII · June 11, 2015, 2:53pm

Yes, I am aware of the weighting of searchwords by google, I recently saw a layout of the search engine by the way, and it seemed pretty much like a monster to me. But I guess they did use something like Levenshtein, in the days, when they didn’t have statistics for everything.

Spell checking is an interesting subject, one theme is autocompletion, which interests me. There are several open source projects that can be scavenged for that purpose, as is aspell, ispell, and xcalibur for spellchecking of documents.

I’m pretty sure your implementation will be an efficient one, and good luck!