Approximate string matching: The Levenshtein distance

McUsrII · May 31, 2015, 9:42pm

Hello.

I don’t know if you have seen the version of the levensthein distance I posted a while ago over at rosettacode.org. I did write it and posted it because someone needed it. My version is based on the fast matrix version, but I only use the Col column of the matrix, I actually did port it from the C version I found at rosettacode.org directly.

It is interesting to see other takes on the same algorithm.

Personally I prefer a handler named mindist for practical purposes, (and not implemented in AppleScript) that just checks for one edit operation, like change character, delete or insert, (a spelling distance of 1).

# Implementation of the "fast" C-version.
set dist to findLevenshteinDistance for "sunday" against "saturday"
to findLevenshteinDistance for s1 against s2
	script o
		property l : s1
		property m : s2
	end script
	if s1 = s2 then return 0
	set ll to length of s1
	set lm to length of s2
	if ll = 0 then return lm
	if lm = 0 then return ll
 
	set v0 to {}
 
	repeat with i from 1 to (lm + 1)
		set end of v0 to (i - 1)
	end repeat
	set item -1 of v0 to 0
	copy v0 to v1
 
	repeat with i from 1 to ll
		-- calculate v1 (current row distances) from the previous row v0
 
		-- first element of v1 is A[i+1][0]
		--   edit distance is delete (i+1) chars from s to match empty t
		set item 1 of v1 to i
		--  use formula to fill in the rest of the row
		repeat with j from 1 to lm
			if item i of o's l = item j of o's m then
				set cost to 0
			else
				set cost to 1
			end if
			set item (j + 1) of v1 to min3 for ((item j of v1) + 1) against ((item (j + 1) of v0) + 1) by ((item j of v0) + cost)
		end repeat
		copy v1 to v0
	end repeat
	return item (lm + 1) of v1
end findLevenshteinDistance
 
to min3 for anInt against anOther by theThird
	if anInt < anOther then
		if theThird < anInt then
			return theThird
		else
			return anInt
		end if
	else
		if theThird < anOther then
			return theThird
		else
			return anOther
		end if
	end if
end min3

DJ_Bazzie_Wazzie · June 1, 2015, 12:49am

I have had seen the code but I didn’t knew it was yours. Out of curiosity I re-wrote it in AppleScript to see how much performance difference there was with my own C implementation. Because it’s doing fine for single words, I decided to post it here also because it was missing from MS.

Today I submitted another code to rosettacode.org where my approach was quite different compared to the other examples. Sites like rosettacode.org unintentionally invites code golfing which leads to very interesting and sometimes odd at first solutions.

McUsrII · June 1, 2015, 1:07am

Hello.

Now I’m curious: how my AppleScript version came out, if you timed the three versions, and which algorithm you posted at rosetta code.

Edit

I did some timing on my own, and your second version seems to be around 4 times faster than mine. That is before I inline the min3 handler, but, I have decided to let it be like it is, so it is comparable to the C version, at rosetta code.

McUsrII · June 1, 2015, 8:48am

Hello.

It just interested me, to see if I could match yours by applying regular optimization. Now it performs considerably better. I learned something as well.


to findLevenshteinDistance for s1 against s2
	script o
		property l : characters of s1
		property m : characters of s2
		property v0 : {}
		property v1 : missing value
	end script
	if s1 = s2 then return 0
	set ll to length of s1
	set lm to length of s2
	if ll = 0 then return lm
	if lm = 0 then return ll
	
	repeat with i from 1 to (lm + 1)
		set end of o's v0 to (i - 1)
	end repeat
	set item -1 of o's v0 to 0
	set o's v1 to items of o's v0
	
	repeat with i from 1 to ll
		-- calculate v1 (current row distances) from the previous row v0
		
		-- first element of v1 is A[i+1][0]
		-- edit distance is delete (i+1) chars from s to match empty t
		set item 1 of o's v1 to i
		-- use formula to fill in the rest of the row
		repeat with j from 1 to lm
			if item i of o's l = item j of o's m then
				set cost to 0
			else
				set cost to 1
			end if
			if ((item j of o's v1) + 1) < ((item (j + 1) of o's v0) + 1) then
				if ((item j of o's v0) + cost) < ((item j of o's v1) + 1) then
					set item (j + 1) of o's v1 to ((item j of o's v0) + cost)
				else
					set item (j + 1) of o's v1 to ((item j of o's v1) + 1)
				end if
			else
				if ((item j of o's v0) + cost) < ((item (j + 1) of o's v0) + 1) then
					set item (j + 1) of o's v1 to ((item j of o's v0) + cost)
				else
					set item (j + 1) of o's v1 to ((item (j + 1) of o's v0) + 1)
				end if
			end if
		end repeat
		set o's v0 to items of o's v1
	end repeat
	return item (lm + 1) of o's v1
end findLevenshteinDistance

Nigel_Garvey · June 1, 2015, 11:37am

Hi.

I’ve had a go at further optimising DJ’s optimised handler.

The first effort reduces the number of minus-ones needed by using lower loop indices, eliminates the ‘if.else if.else’ tests by having a separate loop for the first row, uses ‘end of’ instead of ‘item y of’, and makes references to the current row more direct by giving it its own script-object property and adding it to the matrix only after it’s filled. I’ve modified a couple of DJ’s variable names because they clash with something on my machine (probably the Satimage OSAX).

on levenshteinDistance(s1, s2)
	set xLen to (count s1)
	set yLen to (count s2)
	
	script o
		property |matrix| : {} -- to improve list access
		property charList1 : characters of s1
		property charList2 : characters of s2
		property currentRow : {}
	end script
	
	-- Create and fill the matrix. Firstly the simple first row.
	set o's currentRow to {}
	repeat with x from 0 to xLen
		set end of o's currentRow to x
	end repeat
	set end of o's |matrix| to o's currentRow
	-- Then the more complex remaining rows.
	repeat with y from 1 to yLen
		set o's currentRow to {y}
		repeat with x from 1 to xLen
			set deletion to (item x of o's currentRow) + 1
			if (item x of o's charList1 is item y of o's charList2) then
				set alternate to (item x of end of o's |matrix|)
			else
				set alternate to (item x of end of o's |matrix|) + 1
			end if
			set min to (item (x + 1) of end of o's |matrix|) + 1
			if deletion < min then set min to deletion
			if alternate < min then set min to alternate
			
			set end of o's currentRow to min
		end repeat
		
		set end of o's |matrix| to o's currentRow
	end repeat
	
	return end of end of o's |matrix|
end levenshteinDistance

Only the previous row of the matrix is used above during the creation of the current one and only the last item of the last row is returned. This means the matrix isn’t really needed at all, just a rolling system of current and previous rows:

on levenshteinDistance(s1, s2)
	set xLen to (count s1)
	set yLen to (count s2)
	
	script o
		property charList1 : characters of s1
		property charList2 : characters of s2
		property previousRow : {}
		property currentRow : missing value
	end script
	
	-- Set up the first row.
	repeat with x from 0 to xLen
		set end of o's previousRow to x
	end repeat
	-- Handle the remaining rows in a rolling manner.
	repeat with y from 1 to yLen
		set o's currentRow to {y}
		repeat with x from 1 to xLen
			set deletion to (item x of o's currentRow) + 1
			if (item x of o's charList1 is item y of o's charList2) then
				set alternate to (item x of o's previousRow)
			else
				set alternate to (item x of o's previousRow) + 1
			end if
			set min to (item (x + 1) of o's previousRow) + 1
			if deletion < min then set min to deletion
			if alternate < min then set min to alternate
			
			set end of o's currentRow to min
		end repeat
		
		set o's previousRow to o's currentRow
	end repeat
	
	return end of o's previousRow
end levenshteinDistance

Edit: minor error corrected in the initialisation of the first row in both scripts. :rolleyes:

McUsrII · June 1, 2015, 11:53am

Hello Nigel.

Your last (brilliant) one beats mine with 3 hundreths of a second, which is reason for doing the optimization, in my eyes.
Now, you have adopted the exactly same scheme as my handler is built upon, I have to do a major rework of it to get any better results, so I’ll leave the matters be, as I expect to end up with a blue print of your version anyways.

DJ_Bazzie_Wazzie · June 1, 2015, 12:43pm

Thank you Nigel

Nigel_Garvey · June 1, 2015, 1:49pm

Hi McUsrII.

Yes. Now I’ve had a chance to study yours too, I can see that’s the case.

By the way, I’ve just corrected a minor error in my handlers which was causing the first row in each case to be one item too long. :rolleyes:

McUsrII · June 1, 2015, 2:22pm

Hello Nigel.

It is very hard to beat your version, with respect to time now. And it is an interesting scheme -just needing the contributions of one row/col of a matrix multiplication, because that saves a lot of multiplications and additions, it is not easy to number it, because the strings may be of varying length, but we end up with just a fraction of the number of operations in the order of the square root, so, that is an optimization I’ll look for in other procedures, related to linear algebra and vectors (ColB), maybe graph algorithms as well, but there (at least for transitive closure) you need to consider the whole matrix.

I also took away the “characters of” optimization in the script object, that was a significant optimization in my handler, for speeding up the access of the individual items (characers).

It was all in all very interesting to code up the algorithm, given it works the way it does.

Nigel_Garvey · June 1, 2015, 2:39pm

And one may want to consider using a ‘considering case’ statement, either round the main repeat in the handler or round the call to the handler itself.

McUsrII · June 1, 2015, 3:05pm

Absolutely.

DJ_Bazzie_Wazzie · June 1, 2015, 3:34pm

I haven’t timed the implementations but I think Nigel’s version on efficiency is the best version of all. In C for example, Nigel’s implementation would not be faster (marginally slower due to excessive allocations and freeing of memory) but when used with +1k data his memory usage would be much lower while the bandwidth remained the same due to using a much smaller variable buffer size.

Nigel_Garvey · June 1, 2015, 4:28pm

It’s possible to modify the rolling two-row version to reuse the same two “row” lists throughout the process:

set string1 to "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences."
--same as string 1 but with a typos (deletion, insertion, altering and transposition)
set string2 to "In informatoin theory and computer science, thee Levenshtein distance is a string metric far measuring the diffrence between two sequences."

set edits to levenshteinDistance(string1, string2)

on levenshteinDistance(s1, s2)
	set xLen to (count s1)
	set yLen to (count s2)
	
	script o
		property charList1 : characters of s1
		property charList2 : characters of s2
		property previousRow : missing value
		property currentRow : missing value
	end script
	
	-- Intitialise both "row" lists as the first row.
	set o's previousRow to {0} & o's charList1
	repeat with x from 1 to xLen
		set item (x + 1) of o's previousRow to x
	end repeat
	set o's currentRow to o's previousRow's items
	-- Handle the remaining rows in a rolling manner, the two lists alternating as the previous and current rows.
	repeat with y from 1 to yLen
		set item 1 of o's currentRow to y
		repeat with x from 1 to xLen
			set deletion to (item x of o's currentRow) + 1
			if (item x of o's charList1 is item y of o's charList2) then
				set alternate to (item x of o's previousRow)
			else
				set alternate to (item x of o's previousRow) + 1
			end if
			set min to (item (x + 1) of o's previousRow) + 1
			if deletion < min then set min to deletion
			if alternate < min then set min to alternate
			
			set item (x + 1) of o's currentRow to min
		end repeat
		
		tell o's previousRow
			set o's previousRow to o's currentRow
			set o's currentRow to it
		end tell
	end repeat
	
	return end of o's previousRow
end levenshteinDistance

McUsrII · June 1, 2015, 4:29pm

First, I clocked my version to use 1.26 seconds, while your second version used 0.47 seconds, then I performed the same optimizations as you had done, and I ended up at 0.3 seconds for it on my machine, then Nigel comes a long with his version at a whopping 0.27 seconds; all timings based on 100 iterations over the strings you provided.

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage.

DJ_Bazzie_Wazzie · June 1, 2015, 6:56pm

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn’t matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel’s code because it’s efficiency in used resources.

Thanks for the times

McUsrII · June 1, 2015, 7:07pm

DJ Bazzie Wazzie:

McUsrII:

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage.

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn’t matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel’s code because it’s efficiency in used resources.

Thanks for the times

I read a whole lot about memory allocation, my point is, that instead of allocating memory again and again, and thereby using bandwidth over the bus, you could really just preallocate some static arrays, which you would align with the size of text you read in for each call, then you’d just reinitialize the arrays for each call. The beauty lies in the iterations, here you could just flip two pointers, so that they point to the other array, and access v1 and v0 through those pointers, which is a quite fast alternative to any copying. Maybe we could have utilized that trick in Applescript too, but I doubt it would have a large positive impact.

DJ_Bazzie_Wazzie · June 1, 2015, 7:28pm

McUsrII:

DJ Bazzie Wazzie:

McUsrII:

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage.

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn’t matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel’s code because it’s efficiency in used resources.

Thanks for the times

I read a whole lot about memory allocation, my point is, that instead of allocating memory again and again, and thereby using bandwidth over the bus, you could really just preallocate some static arrays, which you would align with the size of text you read in for each call, then you’d just reinitialize the arrays for each call. The beauty lies in the iterations, here you could just flip two pointers, so that they point to the other array, and access v1 and v0 through those pointers, which is a quite fast alternative to any copying. Maybe we could have utilized that trick in Applescript too, but I doubt it would have a large positive impact.

I know, I was maybe ahead of you and I meant that there would no performance gain for the “v1 and v0” solution either. Even swapping pointers, how small it is in execution time, no swapping is faster. When using static arrays, using stack memory instead of heap, does have a lot of performance gains but as always, using stack memory limits your buffers. It’s an unwritten rule not to use stack memory for these kind of data.

McUsrII · June 1, 2015, 7:40pm

DJ Bazzie Wazzie:

McUsrII:

DJ Bazzie Wazzie:

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn’t matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel’s code because it’s efficiency in used resources.

Thanks for the times

I read a whole lot about memory allocation, my point is, that instead of allocating memory again and again, and thereby using bandwidth over the bus, you could really just preallocate some static arrays, which you would align with the size of text you read in for each call, then you’d just reinitialize the arrays for each call. The beauty lies in the iterations, here you could just flip two pointers, so that they point to the other array, and access v1 and v0 through those pointers, which is a quite fast alternative to any copying. Maybe we could have utilized that trick in Applescript too, but I doubt it would have a large positive impact.

I know, I was maybe ahead of you and I meant that there would no performance gain for the “v1 and v0” solution either. Even swapping pointers, how small it is in execution time, no swapping is faster. When using static arrays, using stack memory instead of heap, does have a lot of performance gains but as always, using stack memory limits your buffers. It’s an unwritten rule not to use stack memory for these kind of data.

Maybe you were ahead of me, I don’t know, I just had the idea of using static storage, whether by declaring reasonably sized arrays up front, or allocating memory from the heap at program startup. And just for the record, Nigel seem to use the “pointer-flipping” optimization already, and I just named them v0 and v1, because I had forgotten what Nigel called them. I mentioned it, because you worried over bandwidth and memory allocations. My point is that those two problems can be almost optimized away with respect to variables that the function uses, save during program load and exit. I had no intention, what so ever of allocating the arrays on the stack.

Edit

I too are aware of the fact that all static variables are created in the data segment? (some segment, anyway), so, no matter how big static variables I declare in a C-function, they will not be allocated on the program stack at run time, like automatic variables.

DJ_Bazzie_Wazzie · June 2, 2015, 12:03am

Just for the record, that behaviour depends on the compiler (settings). Statics can be assigned to the data (initialized), bss(uninitialized) and code(both) segment of the executable. Also it would make no difference whether or not to use statics in this situation, unless you make it recursive. Therefore my reference to stack memory was more general in terms as a fixed memory space that is not heap memory and are all bound to the same limitations.

I don’t worry over bandwidth at all :), I think you misunderstood me. My point was that even when using Nigel’s approach you would use less memory even without gaining any performance. The reason for me to say that explicitly was because the topic looks like (read the timings) that the best implementation is the fastest, without regards to quality of the code itself. So without timing any of the solutions I liked Nigel’s script the most, even if it was slower (which is not).

McUsrII · June 2, 2015, 7:32am

Hello.

I agree in that Nigels second version, is also the most readable, so that is a good reason for me to like it to. After all, performance is just a currency, that we use to “buy” security and functionality for, so it is important, but not the most important property of a routine really, of course, unless it really is, for one routine.