Approximate string matching: The Levenshtein distance

McUsrII · June 1, 2015, 11:53am

Hello Nigel.

Your last (brilliant) one beats mine with 3 hundreths of a second, which is reason for doing the optimization, in my eyes.
Now, you have adopted the exactly same scheme as my handler is built upon, I have to do a major rework of it to get any better results, so I’ll leave the matters be, as I expect to end up with a blue print of your version anyways.

DJ_Bazzie_Wazzie · June 1, 2015, 12:43pm

Thank you Nigel

Nigel_Garvey · June 1, 2015, 1:49pm

Hi McUsrII.

Yes. Now I’ve had a chance to study yours too, I can see that’s the case.

By the way, I’ve just corrected a minor error in my handlers which was causing the first row in each case to be one item too long. :rolleyes:

McUsrII · June 1, 2015, 2:22pm

Hello Nigel.

It is very hard to beat your version, with respect to time now. And it is an interesting scheme -just needing the contributions of one row/col of a matrix multiplication, because that saves a lot of multiplications and additions, it is not easy to number it, because the strings may be of varying length, but we end up with just a fraction of the number of operations in the order of the square root, so, that is an optimization I’ll look for in other procedures, related to linear algebra and vectors (ColB), maybe graph algorithms as well, but there (at least for transitive closure) you need to consider the whole matrix.

I also took away the “characters of” optimization in the script object, that was a significant optimization in my handler, for speeding up the access of the individual items (characers).

It was all in all very interesting to code up the algorithm, given it works the way it does.

Nigel_Garvey · June 1, 2015, 2:39pm

And one may want to consider using a ‘considering case’ statement, either round the main repeat in the handler or round the call to the handler itself.

McUsrII · June 1, 2015, 3:05pm

Absolutely.

DJ_Bazzie_Wazzie · June 1, 2015, 3:34pm

I haven’t timed the implementations but I think Nigel’s version on efficiency is the best version of all. In C for example, Nigel’s implementation would not be faster (marginally slower due to excessive allocations and freeing of memory) but when used with +1k data his memory usage would be much lower while the bandwidth remained the same due to using a much smaller variable buffer size.

Nigel_Garvey · June 1, 2015, 4:28pm

It’s possible to modify the rolling two-row version to reuse the same two “row” lists throughout the process:

set string1 to "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences."
--same as string 1 but with a typos (deletion, insertion, altering and transposition)
set string2 to "In informatoin theory and computer science, thee Levenshtein distance is a string metric far measuring the diffrence between two sequences."

set edits to levenshteinDistance(string1, string2)

on levenshteinDistance(s1, s2)
	set xLen to (count s1)
	set yLen to (count s2)
	
	script o
		property charList1 : characters of s1
		property charList2 : characters of s2
		property previousRow : missing value
		property currentRow : missing value
	end script
	
	-- Intitialise both "row" lists as the first row.
	set o's previousRow to {0} & o's charList1
	repeat with x from 1 to xLen
		set item (x + 1) of o's previousRow to x
	end repeat
	set o's currentRow to o's previousRow's items
	-- Handle the remaining rows in a rolling manner, the two lists alternating as the previous and current rows.
	repeat with y from 1 to yLen
		set item 1 of o's currentRow to y
		repeat with x from 1 to xLen
			set deletion to (item x of o's currentRow) + 1
			if (item x of o's charList1 is item y of o's charList2) then
				set alternate to (item x of o's previousRow)
			else
				set alternate to (item x of o's previousRow) + 1
			end if
			set min to (item (x + 1) of o's previousRow) + 1
			if deletion < min then set min to deletion
			if alternate < min then set min to alternate
			
			set item (x + 1) of o's currentRow to min
		end repeat
		
		tell o's previousRow
			set o's previousRow to o's currentRow
			set o's currentRow to it
		end tell
	end repeat
	
	return end of o's previousRow
end levenshteinDistance

McUsrII · June 1, 2015, 4:29pm

First, I clocked my version to use 1.26 seconds, while your second version used 0.47 seconds, then I performed the same optimizations as you had done, and I ended up at 0.3 seconds for it on my machine, then Nigel comes a long with his version at a whopping 0.27 seconds; all timings based on 100 iterations over the strings you provided.

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage.

DJ_Bazzie_Wazzie · June 1, 2015, 6:56pm

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn’t matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel’s code because it’s efficiency in used resources.

Thanks for the times

McUsrII · June 1, 2015, 7:07pm

DJ Bazzie Wazzie:

McUsrII:

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage.

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn’t matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel’s code because it’s efficiency in used resources.

Thanks for the times

I read a whole lot about memory allocation, my point is, that instead of allocating memory again and again, and thereby using bandwidth over the bus, you could really just preallocate some static arrays, which you would align with the size of text you read in for each call, then you’d just reinitialize the arrays for each call. The beauty lies in the iterations, here you could just flip two pointers, so that they point to the other array, and access v1 and v0 through those pointers, which is a quite fast alternative to any copying. Maybe we could have utilized that trick in Applescript too, but I doubt it would have a large positive impact.

DJ_Bazzie_Wazzie · June 1, 2015, 7:28pm

McUsrII:

DJ Bazzie Wazzie:

McUsrII:

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage.

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn’t matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel’s code because it’s efficiency in used resources.

Thanks for the times

I read a whole lot about memory allocation, my point is, that instead of allocating memory again and again, and thereby using bandwidth over the bus, you could really just preallocate some static arrays, which you would align with the size of text you read in for each call, then you’d just reinitialize the arrays for each call. The beauty lies in the iterations, here you could just flip two pointers, so that they point to the other array, and access v1 and v0 through those pointers, which is a quite fast alternative to any copying. Maybe we could have utilized that trick in Applescript too, but I doubt it would have a large positive impact.

I know, I was maybe ahead of you and I meant that there would no performance gain for the “v1 and v0” solution either. Even swapping pointers, how small it is in execution time, no swapping is faster. When using static arrays, using stack memory instead of heap, does have a lot of performance gains but as always, using stack memory limits your buffers. It’s an unwritten rule not to use stack memory for these kind of data.

McUsrII · June 1, 2015, 7:40pm

DJ Bazzie Wazzie:

McUsrII:

DJ Bazzie Wazzie:

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn’t matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel’s code because it’s efficiency in used resources.

Thanks for the times

I read a whole lot about memory allocation, my point is, that instead of allocating memory again and again, and thereby using bandwidth over the bus, you could really just preallocate some static arrays, which you would align with the size of text you read in for each call, then you’d just reinitialize the arrays for each call. The beauty lies in the iterations, here you could just flip two pointers, so that they point to the other array, and access v1 and v0 through those pointers, which is a quite fast alternative to any copying. Maybe we could have utilized that trick in Applescript too, but I doubt it would have a large positive impact.

I know, I was maybe ahead of you and I meant that there would no performance gain for the “v1 and v0” solution either. Even swapping pointers, how small it is in execution time, no swapping is faster. When using static arrays, using stack memory instead of heap, does have a lot of performance gains but as always, using stack memory limits your buffers. It’s an unwritten rule not to use stack memory for these kind of data.

Maybe you were ahead of me, I don’t know, I just had the idea of using static storage, whether by declaring reasonably sized arrays up front, or allocating memory from the heap at program startup. And just for the record, Nigel seem to use the “pointer-flipping” optimization already, and I just named them v0 and v1, because I had forgotten what Nigel called them. I mentioned it, because you worried over bandwidth and memory allocations. My point is that those two problems can be almost optimized away with respect to variables that the function uses, save during program load and exit. I had no intention, what so ever of allocating the arrays on the stack.

Edit

I too are aware of the fact that all static variables are created in the data segment? (some segment, anyway), so, no matter how big static variables I declare in a C-function, they will not be allocated on the program stack at run time, like automatic variables.

DJ_Bazzie_Wazzie · June 2, 2015, 12:03am

Just for the record, that behaviour depends on the compiler (settings). Statics can be assigned to the data (initialized), bss(uninitialized) and code(both) segment of the executable. Also it would make no difference whether or not to use statics in this situation, unless you make it recursive. Therefore my reference to stack memory was more general in terms as a fixed memory space that is not heap memory and are all bound to the same limitations.

I don’t worry over bandwidth at all :), I think you misunderstood me. My point was that even when using Nigel’s approach you would use less memory even without gaining any performance. The reason for me to say that explicitly was because the topic looks like (read the timings) that the best implementation is the fastest, without regards to quality of the code itself. So without timing any of the solutions I liked Nigel’s script the most, even if it was slower (which is not).

McUsrII · June 2, 2015, 7:32am

Hello.

I agree in that Nigels second version, is also the most readable, so that is a good reason for me to like it to. After all, performance is just a currency, that we use to “buy” security and functionality for, so it is important, but not the most important property of a routine really, of course, unless it really is, for one routine.

Nigel_Garvey · June 3, 2015, 12:28pm

A couple of further possibilities:

Make ‘charList1’ and ‘charList2’ the ‘id’ of the respective strings rather than the ‘characters’. This will make the character comparisons sensitive to everything and speed them up because they’ll be integer rather than text comparisons.
Test the strings (or the ‘id’ lists) for equality before proceeding with the rest of the process. This will greatly speed up the return of 0 for equal strings without noticeably slowing the treatment of unequal strings.

set string1 to "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences."
--same as string 1 but with a typos (deletion, insertion, altering and transposition)
set string2 to "In informatoin theory and computer science, thee Levenshtein distance is a string metric far measuring the diffrence between two sequences."

set edits to levenshteinDistance(string1, string2)

on levenshteinDistance(s1, s2)
	set xLen to (count s1)
	set yLen to (count s2)
	
	script o
		property charList1 : id of s1 -- For everything sensitivity .
		property charList2 : id of s2 -- . and speed.
		property previousRow : missing value
		property currentRow : missing value
	end script
	
	-- Return 0 straight away if the two strings are equal.
	if (o's charList1 = o's charList2) then return 0
	
	-- Otherwise intitialise two "row" lists as the first row of a notional matrix.
	set o's previousRow to {0} & o's charList1
	repeat with x from 1 to xLen
		set item (x + 1) of o's previousRow to x
	end repeat
	set o's currentRow to o's previousRow's items
	-- Handle the remaining rows in a rolling manner, the two lists alternating as previous and current rows.
	repeat with y from 1 to yLen
		set item 1 of o's currentRow to y
		repeat with x from 1 to xLen
			set deletion to (item x of o's currentRow) + 1
			if (item x of o's charList1 is item y of o's charList2) then
				set alternate to (item x of o's previousRow)
			else
				set alternate to (item x of o's previousRow) + 1
			end if
			set min to (item (x + 1) of o's previousRow) + 1
			if (deletion < min) then set min to deletion
			if (alternate < min) then set min to alternate
			
			set item (x + 1) of o's currentRow to min
		end repeat
		
		tell o's previousRow
			set o's previousRow to o's currentRow
			set o's currentRow to it
		end tell
	end repeat
	
	return end of o's previousRow
end levenshteinDistance

McUsrII · June 3, 2015, 1:29pm

Hello Nigel.

Great ideas for optimizing the handler. I at least can se no reason why anybody would want such a handler to be case insensitive. I haven’t timed it, but I wager it is alot faster.

DJ_Bazzie_Wazzie · June 3, 2015, 1:57pm

Spelling corrections, where it is used most, is a mix of case sensitive and case insensitive. Also my finished project had to be case insensitive. However, when you still want it case insensitive you can convert both strings to the same case (lowercase) before calling this function.

McUsrII · June 4, 2015, 2:12pm

Hello.

Here is another idea, regarding spelling distances of filenames. The idea, is that a user types in a filename, or something else that is word sized, and you can iterate of the contents of the container of the items, and return the one with the least distance.

on spdist for s1 against s2
	-- returns coarsely the differences between two strings
	-- 0 if strings are identical
	-- 1 if two chars are transposed, or one is different
	-- 2 if one char added or deleted
	-- 3 otherwise.
	
	-- it is really meant solve problems with filenames that are
	-- misspelled by one character 
	-- Ported from "The Unix Programming Environment" p.211 
	-- by B.W Kernighan and R. Pike.
	
	script o
		property l : s1's id
		property m : s2's id
	end script
	
	set l1 to length of s1
	set l2 to length of s2
	
	if l1 = l2 then
		set diffs to 0
		set ante to 0
		repeat with i from 1 to l1
			if item i of o's l â‰  item i of o's m then
				if diffs = 0 then
					if ante = 0 then -- first error 
						set ante to 1 -- 1 spelling error.
					else if ante = 1 then
						if item i of o's l = item (i - 1) of o's m and item (i - 1) of o's l = item i of o's m then
							set diffs to 1 -- one full transpostion.
						else
							set diffs to 3 -- two spelling errors.
						end if
						set ante to -1
					end if
				else
					set diffs to 3
				end if
			end if
		end repeat
		if diffs = 0 and ante = 1 then set diffs to 1
	else
		if l1 > l2 then
			if (l1 - l2) < 2 then
				set diffs to 2
			else
				set diffs to 3
			end if
		else
			if (l2 - l1) < 2 then
				set diffs to 2
			else
				set diffs to 3
			end if
		end if
	end if
	return diffs
end spdist

Edit

So I mulled over it a little, and I changed the code to work similiar to the description, almost: A transpositon, may not be an actual transposition, and, a single misspelled character also counts as a transposition. So in the worst case, two independently misspelled characters counts as one transposition.

Edit++
It is changed slightly so it now differs between a “true” transposition, and different characters at two different places.

McUsrII · June 11, 2015, 11:14am

Hello.

I just found out that Levenshtein distances may be used for computing queries. I just provide a link for those interested. Levenshtein Automata (2010) | Hacker News. -You reach the original link by clicking on it at the top of the page.