Friday, July 20, 2018

You are not logged in.

#1 2015-05-30 06:18:54 am

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2778
Website

Approximate string matching: The Levenshtein distance

Fuzzy searching is a pretty popular term nowadays and when you have a typo in your favourite search engine it will ask you if you didn't meant something else. Or the spell checker in Mac OS X or FF underlines a wrongly typed word with a red dashed line and you can choose the proper word from a contextual menu when clicking on it. Vladimir Levenshtein has thought of an algorithm that will indicate how much edits are minimum required to turn one string into the other. The Levenshtein distance will only use insertion, deletion and alternation as 3 different types of required edits.

Note: AppleScript is a slow language and therefore longer strings will take up time exponentially. The handler can be improved but wrote it in an verbose way so it's clear how it works.

Applescript:

set string1 to "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences."
--same as string 1 but with a typos (deletion, insertion, altering and transposition)
set string2 to "In informatoin theory and computer science, thee Levenshtein distance is a string metric far measuring the diffrence between two sequences."

set edits to levenshteinDistance(string1, string2)

set percentage to ((100 - 1 / (count string1) * 100 * edits) * 100 as integer) / 100
set msg to "Strings matches for " & percentage & "%, it requires " & edits & " single-character edits to turn one string into the other."

display dialog msg

on levenshteinDistance(s1, s2)
set matrix to {}
set xLen to count s1
set yLen to count s2

script o
property matrixRef : matrix -- to improve list access
end script

-- first create a matrix
repeat with y from 0 to yLen
set theRow to {}
repeat with x from 0 to xLen
if x = 0 then
set end of theRow to y
else if y = 0 then
set end of theRow to x
else
set end of theRow to 0
end if
end repeat
set end of o's matrixRef to theRow
end repeat

-- fill in the matrix
repeat with y from 2 to yLen + 1
repeat with x from 2 to xLen + 1
set charX to character (x - 1) of s1
set charY to character (y - 1) of s2

if charX = charY then
set edit to 0
else
set edit to 1
end if

set deletion to (item (x - 1) of item y of o's matrixRef) + 1
set insertion to (item x of item (y - 1) of o's matrixRef) + 1
set alternate to (item (x - 1) of item (y - 1) of o's matrixRef) + edit

set minimum to insertion

if deletion < minimum then set minimum to deletion
if alternate < minimum then set minimum to alternate

set item x of item y of o's matrixRef to minimum
end repeat
end repeat

-- return results
return item (xLen + 1) of item (yLen + 1) of o's matrixRef
end levenshteinDistance

Extension: The Damerau-Levenshtein is an popular extension which has a transposition as the 4th edit, right letters in the wrong order. Damerau stated that more than 80% of all spelling errors are 4 or these errors including transposition. So a 2 step edit which is a transposition doesn't weigh as 2 edits.

Last edited by DJ Bazzie Wazzie (2015-05-30 07:12:02 am)

Offline

#2 2015-05-30 07:56:26 am

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2778
Website

Re: Approximate string matching: The Levenshtein distance

Here an more optimized version where creating and filling the matrix is done in one go:

Applescript:

on levenshteinDistance(s1, s2)
set matrix to {}
set xLen to (count s1) + 1
set yLen to (count s2) + 1

script o
property matrixRef : matrix -- to improve list access
property charList1 : characters of s1
property charList2 : characters of s2
end script

-- create and fill the matrix
repeat with y from 1 to yLen
set end of o's matrixRef to {}
repeat with x from 1 to xLen
if x = 1 then
set end of item y of o's matrixRef to (y - 1)
else if y = 1 then
set end of item y of o's matrixRef to (x - 1)
else
set deletion to (item (x - 1) of item y of o's matrixRef) + 1
set alternate to (item (x - 1) of item (y - 1) of o's matrixRef) + ((item (x - 1) of o's charList1 is not item (y - 1) of o's charList2) as integer)
set minimum to (item x of item (y - 1) of o's matrixRef) + 1
if deletion < minimum then set minimum to deletion
if alternate < minimum then set minimum to alternate

set end of item y of o's matrixRef to minimum
end if
end repeat
end repeat
return item (xLen) of item (yLen) of o's matrixRef
end levenshteinDistance

Last edited by DJ Bazzie Wazzie (2015-05-30 08:10:41 am)

Offline

#3 2015-05-31 03:42:50 pm

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Hello.

I don't know if you have seen the version of the levensthein distance I posted a while ago over at rosettacode.org. I did write it and posted it because someone needed it. My version is based on the fast matrix version, but I only use the Col column of the matrix, I actually did port it from the C version I found at rosettacode.org directly.

It is interesting to see other takes on the same algorithm.

Personally I prefer a handler named mindist for practical purposes, (and not implemented in AppleScript) that just checks for one edit operation, like change character, delete or insert, (a spelling distance of 1).

Applescript:

# Implementation of the "fast" C-version.
set dist to findLevenshteinDistance for "sunday" against "saturday"
to findLevenshteinDistance for s1 against s2
script o
property l : s1
property m : s2
end script
if s1 = s2 then return 0
set ll to length of s1
set lm to length of s2
if ll = 0 then return lm
if lm = 0 then return ll

set v0 to {}

repeat with i from 1 to (lm + 1)
set end of v0 to (i - 1)
end repeat
set item -1 of v0 to 0
copy v0 to v1

repeat with i from 1 to ll
-- calculate v1 (current row distances) from the previous row v0

-- first element of v1 is A[i+1][0]
-- edit distance is delete (i+1) chars from s to match empty t
set item 1 of v1 to i
-- use formula to fill in the rest of the row
repeat with j from 1 to lm
if item i of o's l = item j of o's m then
set cost to 0
else
set cost to 1
end if
set item (j + 1) of v1 to min3 for ((item j of v1) + 1) against ((item (j + 1) of v0) + 1) by ((item j of v0) + cost)
end repeat
copy v1 to v0
end repeat
return item (lm + 1) of v1
end findLevenshteinDistance

to min3 for anInt against anOther by theThird
if anInt < anOther then
if theThird < anInt then
return theThird
else
return anInt
end if
else
if theThird < anOther then
return theThird
else
return anOther
end if
end if
end min3

Last edited by McUsrII (2015-05-31 03:54:30 pm)

Offline

#4 2015-05-31 06:49:24 pm

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2778
Website

Re: Approximate string matching: The Levenshtein distance

McUsrII wrote:

I don't know if you have seen the version of the levensthein distance I posted a while ago over at rosettacode.org.

I have had seen the code but I didn't knew it was yours. Out of curiosity I re-wrote it in AppleScript to see how much performance difference there was with my own C implementation. Because it's doing fine for single words, I decided to post it here also because it was missing from MS.

McUsrII wrote:

It is interesting to see other takes on the same algorithm.

Today I submitted another code to rosettacode.org where my approach was quite different compared to the other examples. Sites like rosettacode.org unintentionally invites code golfing which leads to very interesting and sometimes odd at first solutions.

Last edited by DJ Bazzie Wazzie (2015-05-31 06:49:56 pm)

Offline

#5 2015-05-31 07:07:09 pm

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Hello.

Now I'm curious: how my AppleScript version came out, if you timed the three versions, and which algorithm you posted at rosetta code.

Edit

I did some timing on my own, and your second version seems to be around 4 times faster than mine. That is before I inline the min3 handler, but, I have decided to let it be like it is, so it is comparable to the C version, at rosetta code.

Last edited by McUsrII (2015-06-01 02:50:08 am)

Offline

#6 2015-06-01 02:48:58 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Hello.

It just interested me, to see if I could match yours by applying regular optimization. Now it performs considerably better. I learned something as well.

Applescript:

to findLevenshteinDistance for s1 against s2
script o
property l : characters of s1
property m : characters of s2
property v0 : {}
property v1 : missing value
end script
if s1 = s2 then return 0
set ll to length of s1
set lm to length of s2
if ll = 0 then return lm
if lm = 0 then return ll

repeat with i from 1 to (lm + 1)
set end of o's v0 to (i - 1)
end repeat
set item -1 of o's v0 to 0
set o's v1 to items of o's v0

repeat with i from 1 to ll
-- calculate v1 (current row distances) from the previous row v0

-- first element of v1 is A[i+1][0]
-- edit distance is delete (i+1) chars from s to match empty t
set item 1 of o's v1 to i
-- use formula to fill in the rest of the row
repeat with j from 1 to lm
if item i of o's l = item j of o's m then
set cost to 0
else
set cost to 1
end if
if ((item j of o's v1) + 1) < ((item (j + 1) of o's v0) + 1) then
if ((item j of o's v0) + cost) < ((item j of o's v1) + 1) then
set item (j + 1) of o's v1 to ((item j of o's v0) + cost)
else
set item (j + 1) of o's v1 to ((item j of o's v1) + 1)
end if
else
if ((item j of o's v0) + cost) < ((item (j + 1) of o's v0) + 1) then
set item (j + 1) of o's v1 to ((item j of o's v0) + cost)
else
set item (j + 1) of o's v1 to ((item (j + 1) of o's v0) + 1)
end if
end if
end repeat
set o's v0 to items of o's v1
end repeat
return item (lm + 1) of o's v1
end findLevenshteinDistance

Offline

#7 2015-06-01 05:37:20 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4604

Re: Approximate string matching: The Levenshtein distance

Hi.

I've had a go at further optimising DJ's optimised handler.

The first effort reduces the number of minus-ones needed by using lower loop indices, eliminates the 'ifâ€¦else ifâ€¦else' tests by having a separate loop for the first row, uses 'end of' instead of 'item y of', and makes references to the current row more direct by giving it its own script-object property and adding it to the matrix only after it's filled. I've modified a couple of DJ's variable names because they clash with something on my machine (probably the Satimage OSAX).

Applescript:

on levenshteinDistance(s1, s2)
set xLen to (count s1)
set yLen to (count s2)

script o
property |matrix| : {} -- to improve list access
property charList1 : characters of s1
property charList2 : characters of s2
property currentRow : {}
end script

-- Create and fill the matrix. Firstly the simple first row.
set o's currentRow to {}
repeat with x from 0 to xLen
set end of o's currentRow to x
end repeat
set end of o's |matrix| to o's currentRow
-- Then the more complex remaining rows.
repeat with y from 1 to yLen
set o's currentRow to {y}
repeat with x from 1 to xLen
set deletion to (item x of o's currentRow) + 1
if (item x of o's charList1 is item y of o's charList2) then
set alternate to (item x of end of o's |matrix|)
else
set alternate to (item x of end of o's |matrix|) + 1
end if
set min to (item (x + 1) of end of o's |matrix|) + 1
if deletion < min then set min to deletion
if alternate < min then set min to alternate

set end of o's currentRow to min
end repeat

set end of o's |matrix| to o's currentRow
end repeat

return end of end of o's |matrix|
end levenshteinDistance

Only the previous row of the matrix is used above during the creation of the current one and only the last item of the last row is returned. This means the matrix isn't really needed at all, just a rolling system of current and previous rows:

Applescript:

on levenshteinDistance(s1, s2)
set xLen to (count s1)
set yLen to (count s2)

script o
property charList1 : characters of s1
property charList2 : characters of s2
property previousRow : {}
property currentRow : missing value
end script

-- Set up the first row.
repeat with x from 0 to xLen
set end of o's previousRow to x
end repeat
-- Handle the remaining rows in a rolling manner.
repeat with y from 1 to yLen
set o's currentRow to {y}
repeat with x from 1 to xLen
set deletion to (item x of o's currentRow) + 1
if (item x of o's charList1 is item y of o's charList2) then
set alternate to (item x of o's previousRow)
else
set alternate to (item x of o's previousRow) + 1
end if
set min to (item (x + 1) of o's previousRow) + 1
if deletion < min then set min to deletion
if alternate < min then set min to alternate

set end of o's currentRow to min
end repeat

set o's previousRow to o's currentRow
end repeat

return end of o's previousRow
end levenshteinDistance

Edit: minor error corrected in the initialisation of the first row in both scripts.

Last edited by Nigel Garvey (2015-06-01 07:42:09 am)

NG

Offline

#8 2015-06-01 05:53:15 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Hello Nigel.

Your last (brilliant) one beats mine with 3 hundreths of a second, which is reason for doing the optimization, in my eyes.
Now, you have adopted the exactly same scheme as my handler is built upon, I have to do a major rework of it to get any better results, so I'll leave the matters be, as I expect to end up with a blue print of your version anyways.

Last edited by McUsrII (2015-06-01 06:27:02 am)

Offline

#9 2015-06-01 06:43:14 am

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2778
Website

Thank you Nigel

Offline

#10 2015-06-01 07:49:48 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4604

Re: Approximate string matching: The Levenshtein distance

McUsrII wrote:

â€¦ you have adopted the exactly same scheme as my handler is built upon â€¦

Hi McUsrII.

Yes. Now I've had a chance to study yours too, I can see that's the case.

By the way, I've just corrected a minor error in my handlers which was causing the first row in each case to be one item too long.

NG

Offline

#11 2015-06-01 08:22:05 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Hello Nigel.

It is very hard to beat your version, with respect to time now. And  it is an interesting scheme -just needing the contributions of one row/col of a matrix multiplication, because that saves a lot of multiplications and additions, it is not easy to number it, because the strings may be of varying length, but we end up with just a fraction of the number of operations in the order of the square root, so, that is an optimization I'll look for in other procedures, related to linear algebra and vectors (ColB), maybe graph algorithms as well, but there (at least for transitive closure) you need to consider the whole matrix.

I also took away the "characters of" optimization in the script object, that was a significant optimization in my handler, for speeding up the access of the individual items (characers).

It was all in all very interesting to code up the algorithm, given it works the way it does.

Offline

#12 2015-06-01 08:39:30 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4604

Re: Approximate string matching: The Levenshtein distance

And one may want to consider using a 'considering case' statement, either round the main repeat in the handler or round the call to the handler itself.

NG

Offline

#13 2015-06-01 09:05:38 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Absolutely.

Offline

#14 2015-06-01 09:34:53 am

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2778
Website

Re: Approximate string matching: The Levenshtein distance

McUsrII wrote:

It is very hard to beat your version, with respect to time now. And  it is an interesting scheme...

I haven't timed the implementations but I think Nigel's version on efficiency is the best version of all. In C for example, Nigel's implementation would not be faster (marginally slower due to excessive allocations and freeing of memory) but when used with +1k data his memory usage would be much lower while the bandwidth remained the same due to using a much smaller variable buffer size.

Offline

#15 2015-06-01 10:28:28 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4604

Re: Approximate string matching: The Levenshtein distance

It's possible to modify the rolling two-row version to reuse the same two "row" lists throughout the process:

Applescript:

set string1 to "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences."
--same as string 1 but with a typos (deletion, insertion, altering and transposition)
set string2 to "In informatoin theory and computer science, thee Levenshtein distance is a string metric far measuring the diffrence between two sequences."

set edits to levenshteinDistance(string1, string2)

on levenshteinDistance(s1, s2)
set xLen to (count s1)
set yLen to (count s2)

script o
property charList1 : characters of s1
property charList2 : characters of s2
property previousRow : missing value
property currentRow : missing value
end script

-- Intitialise both "row" lists as the first row.
set o's previousRow to {0} & o's charList1
repeat with x from 1 to xLen
set item (x + 1) of o's previousRow to x
end repeat
set o's currentRow to o's previousRow's items
-- Handle the remaining rows in a rolling manner, the two lists alternating as the previous and current rows.
repeat with y from 1 to yLen
set item 1 of o's currentRow to y
repeat with x from 1 to xLen
set deletion to (item x of o's currentRow) + 1
if (item x of o's charList1 is item y of o's charList2) then
set alternate to (item x of o's previousRow)
else
set alternate to (item x of o's previousRow) + 1
end if
set min to (item (x + 1) of o's previousRow) + 1
if deletion < min then set min to deletion
if alternate < min then set min to alternate

set item (x + 1) of o's currentRow to min
end repeat

tell o's previousRow
set o's previousRow to o's currentRow
set o's currentRow to it
end tell
end repeat

return end of o's previousRow
end levenshteinDistance

Last edited by Nigel Garvey (2015-06-01 10:35:45 am)

NG

Offline

#16 2015-06-01 10:29:11 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

DJ Bazzie Wazzie wrote:
McUsrII wrote:

It is very hard to beat your version, with respect to time now. And  it is an interesting scheme...

I haven't timed the implementations but I think Nigel's version on efficiency is the best version of all. In C for example, Nigel's implementation would not be faster (marginally slower due to excessive allocations and freeing of memory) but when used with +1k data his memory usage would be much lower while the bandwidth remained the same due to using a much smaller variable buffer size.

First, I clocked my version to use 1.26 seconds, while your second version used 0.47 seconds, then I performed the same optimizations as you had done, and I ended up at  0.3 seconds for it on my machine, then Nigel comes a long with his version at a whopping 0.27 seconds; all timings based on 100 iterations over the strings you provided.

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage.

Last edited by McUsrII (2015-06-01 10:30:17 am)

Offline

#17 2015-06-01 12:56:43 pm

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2778
Website

Re: Approximate string matching: The Levenshtein distance

McUsrII wrote:

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage.

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn't matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel's code because it's efficiency in used resources.

Thanks for the times

Offline

#18 2015-06-01 01:07:55 pm

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

DJ Bazzie Wazzie wrote:
McUsrII wrote:

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage.

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn't matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel's code because it's efficiency in used resources.

Thanks for the times

I read a whole lot about memory allocation, my point is, that instead of allocating memory again and again, and thereby using bandwidth over the bus, you could really just preallocate some static arrays, which you would align with the size of text you read in for each call, then you'd just reinitialize the arrays for each call. The beauty lies in the iterations, here you could just flip two pointers, so that they point to the other array, and access v1 and v0 through those pointers, which is a quite fast alternative to any copying. Maybe we could have utilized that trick in Applescript too, but I doubt it would have a large positive impact.

Last edited by McUsrII (2015-06-01 01:08:42 pm)

Offline

#19 2015-06-01 01:28:48 pm

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2778
Website

Re: Approximate string matching: The Levenshtein distance

McUsrII wrote:
DJ Bazzie Wazzie wrote:
McUsrII wrote:

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage.

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn't matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel's code because it's efficiency in used resources.

Thanks for the times

I read a whole lot about memory allocation, my point is, that instead of allocating memory again and again, and thereby using bandwidth over the bus, you could really just preallocate some static arrays, which you would align with the size of text you read in for each call, then you'd just reinitialize the arrays for each call. The beauty lies in the iterations, here you could just flip two pointers, so that they point to the other array, and access v1 and v0 through those pointers, which is a quite fast alternative to any copying. Maybe we could have utilized that trick in Applescript too, but I doubt it would have a large positive impact.

I know, I was maybe ahead of you and I meant that there would no performance gain for the "v1 and v0" solution either. Even swapping pointers, how small it is in execution time, no swapping is faster. When using static arrays, using stack memory instead of heap, does have a lot of performance gains but as always, using stack memory limits your buffers. It's an unwritten rule not to use stack memory for these kind of data.

Offline

#20 2015-06-01 01:40:52 pm

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

DJ Bazzie Wazzie wrote:
McUsrII wrote:
DJ Bazzie Wazzie wrote:

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn't matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel's code because it's efficiency in used resources.

Thanks for the times

I read a whole lot about memory allocation, my point is, that instead of allocating memory again and again, and thereby using bandwidth over the bus, you could really just preallocate some static arrays, which you would align with the size of text you read in for each call, then you'd just reinitialize the arrays for each call. The beauty lies in the iterations, here you could just flip two pointers, so that they point to the other array, and access v1 and v0 through those pointers, which is a quite fast alternative to any copying. Maybe we could have utilized that trick in Applescript too, but I doubt it would have a large positive impact.

I know, I was maybe ahead of you and I meant that there would no performance gain for the "v1 and v0" solution either. Even swapping pointers, how small it is in execution time, no swapping is faster. When using static arrays, using stack memory instead of heap, does have a lot of performance gains but as always, using stack memory limits your buffers. It's an unwritten rule not to use stack memory for these kind of data.

Maybe you were ahead of me, I don't know, I just had the idea of using static storage, whether by declaring reasonably sized arrays up front, or allocating memory from the heap at program startup. And just for the record, Nigel seem to use the "pointer-flipping" optimization already, and I just named them v0 and v1, because I had forgotten what Nigel called them. I mentioned it, because you worried over bandwidth and memory allocations. My point is that those two problems can be almost optimized away with respect to variables that the function uses, save during program load and exit. I had no intention, what so ever of allocating the arrays on the stack.

Edit

I too are aware of the fact that all static variables are created in the data segment? (some segment, anyway), so, no matter how big static variables I declare in a C-function, they will not be allocated on the program stack at run time, like automatic variables.

Last edited by McUsrII (2015-06-01 01:50:20 pm)

Offline

#21 2015-06-01 06:03:03 pm

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2778
Website

Re: Approximate string matching: The Levenshtein distance

McUsrII wrote:

I too are aware of the fact that all static variables are created in the data segment? (some segment, anyway), so, no matter how big static variables I declare in a C-function, they will not be allocated on the program stack at run time, like automatic variables.

Just for the record, that behaviour depends on the compiler (settings). Statics can be assigned to the data (initialized), bss(uninitialized) and code(both) segment of the executable. Also it would make no difference whether or not to use statics in this situation, unless you make it recursive. Therefore my reference to stack memory was more general in terms as a fixed memory space that is not heap memory and are all bound to the same limitations.

I don't worry over bandwidth at all , I think you misunderstood me. My point was that even when using Nigel's approach you would use less memory even without gaining any performance. The reason for me to say that explicitly was because the topic looks like (read the timings) that the best implementation is the fastest, without regards to quality of the code itself. So without timing any of the solutions I liked Nigel's script the most, even if it was slower (which is not).

Last edited by DJ Bazzie Wazzie (2015-06-02 02:25:37 am)

Offline

#22 2015-06-02 01:32:14 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Hello.

I agree in that Nigels second version, is also the most readable, so that is a good reason for me to like it to. After all, performance is just a currency, that we use to "buy" security and functionality for, so it is important, but not the most important property of a routine really, of course, unless it really is, for one routine.

Offline

#23 2015-06-03 06:28:59 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4604

Re: Approximate string matching: The Levenshtein distance

A couple of further possibilities:

1. Make 'charList1' and 'charList2' the 'id' of the respective strings rather than the 'characters'. This will make the character comparisons sensitive to everything and speed them up because they'll be integer rather than text comparisons.

2. Test the strings (or the 'id' lists) for equality before proceeding with the rest of the process. This will greatly speed up the return of 0 for equal strings without noticeably slowing the treatment of unequal strings.

Applescript:

set string1 to "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences."
--same as string 1 but with a typos (deletion, insertion, altering and transposition)
set string2 to "In informatoin theory and computer science, thee Levenshtein distance is a string metric far measuring the diffrence between two sequences."

set edits to levenshteinDistance(string1, string2)

on levenshteinDistance(s1, s2)
set xLen to (count s1)
set yLen to (count s2)

script o
property charList1 : id of s1 -- For everything sensitivity â€¦
property charList2 : id of s2 -- â€¦ and speed.
property previousRow : missing value
property currentRow : missing value
end script

-- Return 0 straight away if the two strings are equal.
if (o's charList1 = o's charList2) then return 0

-- Otherwise intitialise two "row" lists as the first row of a notional matrix.
set o's previousRow to {0} & o's charList1
repeat with x from 1 to xLen
set item (x + 1) of o's previousRow to x
end repeat
set o's currentRow to o's previousRow's items
-- Handle the remaining rows in a rolling manner, the two lists alternating as previous and current rows.
repeat with y from 1 to yLen
set item 1 of o's currentRow to y
repeat with x from 1 to xLen
set deletion to (item x of o's currentRow) + 1
if (item x of o's charList1 is item y of o's charList2) then
set alternate to (item x of o's previousRow)
else
set alternate to (item x of o's previousRow) + 1
end if
set min to (item (x + 1) of o's previousRow) + 1
if (deletion < min) then set min to deletion
if (alternate < min) then set min to alternate

set item (x + 1) of o's currentRow to min
end repeat

tell o's previousRow
set o's previousRow to o's currentRow
set o's currentRow to it
end tell
end repeat

return end of o's previousRow
end levenshteinDistance

NG

Offline

#24 2015-06-03 07:29:52 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Hello Nigel.

Great ideas for optimizing the handler. I at least can se no reason why anybody would want such a handler to be case insensitive. I haven't timed it, but I wager it is *alot* faster.

Offline

#25 2015-06-03 07:57:35 am

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2778
Website

Re: Approximate string matching: The Levenshtein distance

McUsrII wrote:

I at least can se no reason why anybody would want such a handler to be case insensitive.

Spelling corrections, where it is used most, is a mix of case sensitive and case insensitive. Also my finished project had to be case insensitive. However, when you still want it case insensitive you can convert both strings to the same case (lowercase) before calling this function.

Offline