Thursday, November 23, 2017

#1 2015-05-30 07:18:54 am

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2724
Website

Approximate string matching: The Levenshtein distance

Fuzzy searching is a pretty popular term nowadays and when you have a typo in your favourite search engine it will ask you if you didn't meant something else. Or the spell checker in Mac OS X or FF underlines a wrongly typed word with a red dashed line and you can choose the proper word from a contextual menu when clicking on it. Vladimir Levenshtein has thought of an algorithm that will indicate how much edits are minimum required to turn one string into the other. The Levenshtein distance will only use insertion, deletion and alternation as 3 different types of required edits.

Note: AppleScript is a slow language and therefore longer strings will take up time exponentially. The handler can be improved but wrote it in an verbose way so it's clear how it works.

Applescript:

set string1 to "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences."
--same as string 1 but with a typos (deletion, insertion, altering and transposition)
set string2 to "In informatoin theory and computer science, thee Levenshtein distance is a string metric far measuring the diffrence between two sequences."

set edits to levenshteinDistance(string1, string2)

set percentage to ((100 - 1 / (count string1) * 100 * edits) * 100 as integer) / 100
set msg to "Strings matches for " & percentage & "%, it requires " & edits & " single-character edits to turn one string into the other."

display dialog msg

on levenshteinDistance(s1, s2)
   set matrix to {}
   set xLen to count s1
   set yLen to count s2
   
   script o
       property matrixRef : matrix -- to improve list access
   end script
   
   -- first create a matrix
   repeat with y from 0 to yLen
       set theRow to {}
       repeat with x from 0 to xLen
           if x = 0 then
               set end of theRow to y
           else if y = 0 then
               set end of theRow to x
           else
               set end of theRow to 0
           end if
       end repeat
       set end of o's matrixRef to theRow
   end repeat
   
   -- fill in the matrix
   repeat with y from 2 to yLen + 1
       repeat with x from 2 to xLen + 1
           set charX to character (x - 1) of s1
           set charY to character (y - 1) of s2
           
           if charX = charY then
               set edit to 0
           else
               set edit to 1
           end if
           
           set deletion to (item (x - 1) of item y of o's matrixRef) + 1
           set insertion to (item x of item (y - 1) of o's matrixRef) + 1
           set alternate to (item (x - 1) of item (y - 1) of o's matrixRef) + edit
           
           set minimum to insertion
           
           if deletion < minimum then set minimum to deletion
           if alternate < minimum then set minimum to alternate
           
           set item x of item y of o's matrixRef to minimum
       end repeat
   end repeat
   
   -- return results
   return item (xLen + 1) of item (yLen + 1) of o's matrixRef
end levenshteinDistance

Extension: The Damerau-Levenshtein is an popular extension which has a transposition as the 4th edit, right letters in the wrong order. Damerau stated that more than 80% of all spelling errors are 4 or these errors including transposition. So a 2 step edit which is a transposition doesn't weigh as 2 edits.

Last edited by DJ Bazzie Wazzie (2015-05-30 08:12:02 am)

Offline

 

#2 2015-05-30 08:56:26 am

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2724
Website

Re: Approximate string matching: The Levenshtein distance

Here an more optimized version where creating and filling the matrix is done in one go:

Applescript:

on levenshteinDistance(s1, s2)
   set matrix to {}
   set xLen to (count s1) + 1
   set yLen to (count s2) + 1
   
   script o
       property matrixRef : matrix -- to improve list access
       property charList1 : characters of s1
       property charList2 : characters of s2
   end script
   
   -- create and fill the matrix
   repeat with y from 1 to yLen
       set end of o's matrixRef to {}
       repeat with x from 1 to xLen
           if x = 1 then
               set end of item y of o's matrixRef to (y - 1)
           else if y = 1 then
               set end of item y of o's matrixRef to (x - 1)
           else
               set deletion to (item (x - 1) of item y of o's matrixRef) + 1
               set alternate to (item (x - 1) of item (y - 1) of o's matrixRef) + ((item (x - 1) of o's charList1 is not item (y - 1) of o's charList2) as integer)
               set minimum to (item x of item (y - 1) of o's matrixRef) + 1
               if deletion < minimum then set minimum to deletion
               if alternate < minimum then set minimum to alternate
               
               set end of item y of o's matrixRef to minimum
           end if
       end repeat
   end repeat
   return item (xLen) of item (yLen) of o's matrixRef
end levenshteinDistance

Last edited by DJ Bazzie Wazzie (2015-05-30 09:10:41 am)

Offline

 

#3 2015-05-31 04:42:50 pm

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Hello.

I don't know if you have seen the version of the levensthein distance I posted a while ago over at rosettacode.org. I did write it and posted it because someone needed it. My version is based on the fast matrix version, but I only use the Col column of the matrix, I actually did port it from the C version I found at rosettacode.org directly.

It is interesting to see other takes on the same algorithm. smile

Personally I prefer a handler named mindist for practical purposes, (and not implemented in AppleScript) that just checks for one edit operation, like change character, delete or insert, (a spelling distance of 1).

Applescript:

# Implementation of the "fast" C-version.
set dist to findLevenshteinDistance for "sunday" against "saturday"
to findLevenshteinDistance for s1 against s2
   script o
       property l : s1
       property m : s2
   end script
   if s1 = s2 then return 0
   set ll to length of s1
   set lm to length of s2
   if ll = 0 then return lm
   if lm = 0 then return ll

   set v0 to {}

   repeat with i from 1 to (lm + 1)
       set end of v0 to (i - 1)
   end repeat
   set item -1 of v0 to 0
   copy v0 to v1

   repeat with i from 1 to ll
       -- calculate v1 (current row distances) from the previous row v0

       -- first element of v1 is A[i+1][0]
       -- edit distance is delete (i+1) chars from s to match empty t
       set item 1 of v1 to i
       -- use formula to fill in the rest of the row
       repeat with j from 1 to lm
           if item i of o's l = item j of o's m then
               set cost to 0
           else
               set cost to 1
           end if
           set item (j + 1) of v1 to min3 for ((item j of v1) + 1) against ((item (j + 1) of v0) + 1) by ((item j of v0) + cost)
       end repeat
       copy v1 to v0
   end repeat
   return item (lm + 1) of v1
end findLevenshteinDistance

to min3 for anInt against anOther by theThird
   if anInt < anOther then
       if theThird < anInt then
           return theThird
       else
           return anInt
       end if
   else
       if theThird < anOther then
           return theThird
       else
           return anOther
       end if
   end if
end min3

Last edited by McUsrII (2015-05-31 04:54:30 pm)

Offline

 

#4 2015-05-31 07:49:24 pm

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2724
Website

Re: Approximate string matching: The Levenshtein distance

McUsrII wrote:

I don't know if you have seen the version of the levensthein distance I posted a while ago over at rosettacode.org.


I have had seen the code but I didn't knew it was yours. Out of curiosity I re-wrote it in AppleScript to see how much performance difference there was with my own C implementation. Because it's doing fine for single words, I decided to post it here also because it was missing from MS.

McUsrII wrote:

It is interesting to see other takes on the same algorithm. smile


Today I submitted another code to rosettacode.org where my approach was quite different compared to the other examples. Sites like rosettacode.org unintentionally invites code golfing which leads to very interesting and sometimes odd at first solutions.

Last edited by DJ Bazzie Wazzie (2015-05-31 07:49:56 pm)

Offline

 

#5 2015-05-31 08:07:09 pm

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Hello.

Now I'm curious: how my AppleScript version came out, if you timed the three versions, and which algorithm you posted at rosetta code. smile

Edit

I did some timing on my own, and your second version seems to be around 4 times faster than mine. That is before I inline the min3 handler, but, I have decided to let it be like it is, so it is comparable to the C version, at rosetta code.

Last edited by McUsrII (2015-06-01 03:50:08 am)

Offline

 

#6 2015-06-01 03:48:58 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Hello.

It just interested me, to see if I could match yours by applying regular optimization. Now it performs considerably better. I learned something as well. smile

Applescript:


to findLevenshteinDistance for s1 against s2
   script o
       property l : characters of s1
       property m : characters of s2
       property v0 : {}
       property v1 : missing value
   end script
   if s1 = s2 then return 0
   set ll to length of s1
   set lm to length of s2
   if ll = 0 then return lm
   if lm = 0 then return ll
   
   repeat with i from 1 to (lm + 1)
       set end of o's v0 to (i - 1)
   end repeat
   set item -1 of o's v0 to 0
   set o's v1 to items of o's v0
   
   repeat with i from 1 to ll
       -- calculate v1 (current row distances) from the previous row v0
       
       -- first element of v1 is A[i+1][0]
       -- edit distance is delete (i+1) chars from s to match empty t
       set item 1 of o's v1 to i
       -- use formula to fill in the rest of the row
       repeat with j from 1 to lm
           if item i of o's l = item j of o's m then
               set cost to 0
           else
               set cost to 1
           end if
           if ((item j of o's v1) + 1) < ((item (j + 1) of o's v0) + 1) then
               if ((item j of o's v0) + cost) < ((item j of o's v1) + 1) then
                   set item (j + 1) of o's v1 to ((item j of o's v0) + cost)
               else
                   set item (j + 1) of o's v1 to ((item j of o's v1) + 1)
               end if
           else
               if ((item j of o's v0) + cost) < ((item (j + 1) of o's v0) + 1) then
                   set item (j + 1) of o's v1 to ((item j of o's v0) + cost)
               else
                   set item (j + 1) of o's v1 to ((item (j + 1) of o's v0) + 1)
               end if
           end if
       end repeat
       set o's v0 to items of o's v1
   end repeat
   return item (lm + 1) of o's v1
end findLevenshteinDistance

Offline

 

#7 2015-06-01 06:37:20 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4424

Re: Approximate string matching: The Levenshtein distance

Hi.

I've had a go at further optimising DJ's optimised handler.

The first effort reduces the number of minus-ones needed by using lower loop indices, eliminates the 'if…else if…else' tests by having a separate loop for the first row, uses 'end of' instead of 'item y of', and makes references to the current row more direct by giving it its own script-object property and adding it to the matrix only after it's filled. I've modified a couple of DJ's variable names because they clash with something on my machine (probably the Satimage OSAX).

Applescript:

on levenshteinDistance(s1, s2)
   set xLen to (count s1)
   set yLen to (count s2)
   
   script o
       property |matrix| : {} -- to improve list access
       property charList1 : characters of s1
       property charList2 : characters of s2
       property currentRow : {}
   end script
   
   -- Create and fill the matrix. Firstly the simple first row.
   set o's currentRow to {}
   repeat with x from 0 to xLen
       set end of o's currentRow to x
   end repeat
   set end of o's |matrix| to o's currentRow
   -- Then the more complex remaining rows.
   repeat with y from 1 to yLen
       set o's currentRow to {y}
       repeat with x from 1 to xLen
           set deletion to (item x of o's currentRow) + 1
           if (item x of o's charList1 is item y of o's charList2) then
               set alternate to (item x of end of o's |matrix|)
           else
               set alternate to (item x of end of o's |matrix|) + 1
           end if
           set min to (item (x + 1) of end of o's |matrix|) + 1
           if deletion < min then set min to deletion
           if alternate < min then set min to alternate
           
           set end of o's currentRow to min
       end repeat
       
       set end of o's |matrix| to o's currentRow
   end repeat
   
   return end of end of o's |matrix|
end levenshteinDistance

Only the previous row of the matrix is used above during the creation of the current one and only the last item of the last row is returned. This means the matrix isn't really needed at all, just a rolling system of current and previous rows:

Applescript:

on levenshteinDistance(s1, s2)
   set xLen to (count s1)
   set yLen to (count s2)
   
   script o
       property charList1 : characters of s1
       property charList2 : characters of s2
       property previousRow : {}
       property currentRow : missing value
   end script
   
   -- Set up the first row.
   repeat with x from 0 to xLen
       set end of o's previousRow to x
   end repeat
   -- Handle the remaining rows in a rolling manner.
   repeat with y from 1 to yLen
       set o's currentRow to {y}
       repeat with x from 1 to xLen
           set deletion to (item x of o's currentRow) + 1
           if (item x of o's charList1 is item y of o's charList2) then
               set alternate to (item x of o's previousRow)
           else
               set alternate to (item x of o's previousRow) + 1
           end if
           set min to (item (x + 1) of o's previousRow) + 1
           if deletion < min then set min to deletion
           if alternate < min then set min to alternate
           
           set end of o's currentRow to min
       end repeat
       
       set o's previousRow to o's currentRow
   end repeat
   
   return end of o's previousRow
end levenshteinDistance

Edit: minor error corrected in the initialisation of the first row in both scripts.  roll

Last edited by Nigel Garvey (2015-06-01 08:42:09 am)


NG

Offline

 

#8 2015-06-01 06:53:15 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Hello Nigel.

Your last (brilliant) one beats mine with 3 hundreths of a second, which is reason for doing the optimization, in my eyes.
Now, you have adopted the exactly same scheme as my handler is built upon, I have to do a major rework of it to get any better results, so I'll leave the matters be, as I expect to end up with a blue print of your version anyways. smile

Last edited by McUsrII (2015-06-01 07:27:02 am)

Offline

 

#9 2015-06-01 07:43:14 am

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2724
Website

Re: Approximate string matching: The Levenshtein distance

Thank you Nigel cool

Offline

 

#10 2015-06-01 08:49:48 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4424

Re: Approximate string matching: The Levenshtein distance

McUsrII wrote:

… you have adopted the exactly same scheme as my handler is built upon …


Hi McUsrII.

Yes. Now I've had a chance to study yours too, I can see that's the case.  smile

By the way, I've just corrected a minor error in my handlers which was causing the first row in each case to be one item too long.  roll


NG

Offline

 

#11 2015-06-01 09:22:05 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Hello Nigel.

It is very hard to beat your version, with respect to time now. smile And  it is an interesting scheme -just needing the contributions of one row/col of a matrix multiplication, because that saves a lot of multiplications and additions, it is not easy to number it, because the strings may be of varying length, but we end up with just a fraction of the number of operations in the order of the square root, so, that is an optimization I'll look for in other procedures, related to linear algebra and vectors (ColB), maybe graph algorithms as well, but there (at least for transitive closure) you need to consider the whole matrix.

I also took away the "characters of" optimization in the script object, that was a significant optimization in my handler, for speeding up the access of the individual items (characers).

It was all in all very interesting to code up the algorithm, given it works the way it does.

Offline

 

#12 2015-06-01 09:39:30 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4424

Re: Approximate string matching: The Levenshtein distance

And one may want to consider using a 'considering case' statement, either round the main repeat in the handler or round the call to the handler itself.


NG

Offline

 

#13 2015-06-01 10:05:38 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Absolutely. big_smile

Offline

 

#14 2015-06-01 10:34:53 am

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2724
Website

Re: Approximate string matching: The Levenshtein distance

McUsrII wrote:

It is very hard to beat your version, with respect to time now. smile And  it is an interesting scheme...


I haven't timed the implementations but I think Nigel's version on efficiency is the best version of all. In C for example, Nigel's implementation would not be faster (marginally slower due to excessive allocations and freeing of memory) but when used with +1k data his memory usage would be much lower while the bandwidth remained the same due to using a much smaller variable buffer size.

Offline

 

#15 2015-06-01 11:28:28 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4424

Re: Approximate string matching: The Levenshtein distance

It's possible to modify the rolling two-row version to reuse the same two "row" lists throughout the process:

Applescript:

set string1 to "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences."
--same as string 1 but with a typos (deletion, insertion, altering and transposition)
set string2 to "In informatoin theory and computer science, thee Levenshtein distance is a string metric far measuring the diffrence between two sequences."

set edits to levenshteinDistance(string1, string2)

on levenshteinDistance(s1, s2)
   set xLen to (count s1)
   set yLen to (count s2)
   
   script o
       property charList1 : characters of s1
       property charList2 : characters of s2
       property previousRow : missing value
       property currentRow : missing value
   end script
   
   -- Intitialise both "row" lists as the first row.
   set o's previousRow to {0} & o's charList1
   repeat with x from 1 to xLen
       set item (x + 1) of o's previousRow to x
   end repeat
   set o's currentRow to o's previousRow's items
   -- Handle the remaining rows in a rolling manner, the two lists alternating as the previous and current rows.
   repeat with y from 1 to yLen
       set item 1 of o's currentRow to y
       repeat with x from 1 to xLen
           set deletion to (item x of o's currentRow) + 1
           if (item x of o's charList1 is item y of o's charList2) then
               set alternate to (item x of o's previousRow)
           else
               set alternate to (item x of o's previousRow) + 1
           end if
           set min to (item (x + 1) of o's previousRow) + 1
           if deletion < min then set min to deletion
           if alternate < min then set min to alternate
           
           set item (x + 1) of o's currentRow to min
       end repeat
       
       tell o's previousRow
           set o's previousRow to o's currentRow
           set o's currentRow to it
       end tell
   end repeat
   
   return end of o's previousRow
end levenshteinDistance

Last edited by Nigel Garvey (2015-06-01 11:35:45 am)


NG

Offline

 

#16 2015-06-01 11:29:11 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

DJ Bazzie Wazzie wrote:
McUsrII wrote:

It is very hard to beat your version, with respect to time now. smile And  it is an interesting scheme...


I haven't timed the implementations but I think Nigel's version on efficiency is the best version of all. In C for example, Nigel's implementation would not be faster (marginally slower due to excessive allocations and freeing of memory) but when used with +1k data his memory usage would be much lower while the bandwidth remained the same due to using a much smaller variable buffer size.


First, I clocked my version to use 1.26 seconds, while your second version used 0.47 seconds, then I performed the same optimizations as you had done, and I ended up at  0.3 seconds for it on my machine, then Nigel comes a long with his version at a whopping 0.27 seconds; all timings based on 100 iterations over the strings you provided.

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage. smile

Last edited by McUsrII (2015-06-01 11:30:17 am)

Offline

 

#17 2015-06-01 01:56:43 pm

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2724
Website

Re: Approximate string matching: The Levenshtein distance

McUsrII wrote:

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage. smile


What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn't matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel's code because it's efficiency in used resources.

Thanks for the times smile

Offline

 

#18 2015-06-01 02:07:55 pm

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

DJ Bazzie Wazzie wrote:
McUsrII wrote:

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage. smile


What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn't matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel's code because it's efficiency in used resources.

Thanks for the times smile


I read a whole lot about memory allocation, my point is, that instead of allocating memory again and again, and thereby using bandwidth over the bus, you could really just preallocate some static arrays, which you would align with the size of text you read in for each call, then you'd just reinitialize the arrays for each call. The beauty lies in the iterations, here you could just flip two pointers, so that they point to the other array, and access v1 and v0 through those pointers, which is a quite fast alternative to any copying. Maybe we could have utilized that trick in Applescript too, but I doubt it would have a large positive impact. smile

Last edited by McUsrII (2015-06-01 02:08:42 pm)

Offline

 

#19 2015-06-01 02:28:48 pm

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2724
Website

Re: Approximate string matching: The Levenshtein distance

McUsrII wrote:
DJ Bazzie Wazzie wrote:
McUsrII wrote:

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage. smile


What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn't matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel's code because it's efficiency in used resources.

Thanks for the times smile


I read a whole lot about memory allocation, my point is, that instead of allocating memory again and again, and thereby using bandwidth over the bus, you could really just preallocate some static arrays, which you would align with the size of text you read in for each call, then you'd just reinitialize the arrays for each call. The beauty lies in the iterations, here you could just flip two pointers, so that they point to the other array, and access v1 and v0 through those pointers, which is a quite fast alternative to any copying. Maybe we could have utilized that trick in Applescript too, but I doubt it would have a large positive impact. smile


I know, I was maybe ahead of you and I meant that there would no performance gain for the "v1 and v0" solution either. Even swapping pointers, how small it is in execution time, no swapping is faster. When using static arrays, using stack memory instead of heap, does have a lot of performance gains but as always, using stack memory limits your buffers. It's an unwritten rule not to use stack memory for these kind of data.

Offline

 

#20 2015-06-01 02:40:52 pm

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

DJ Bazzie Wazzie wrote:
McUsrII wrote:
DJ Bazzie Wazzie wrote:

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn't matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel's code because it's efficiency in used resources.

Thanks for the times smile


I read a whole lot about memory allocation, my point is, that instead of allocating memory again and again, and thereby using bandwidth over the bus, you could really just preallocate some static arrays, which you would align with the size of text you read in for each call, then you'd just reinitialize the arrays for each call. The beauty lies in the iterations, here you could just flip two pointers, so that they point to the other array, and access v1 and v0 through those pointers, which is a quite fast alternative to any copying. Maybe we could have utilized that trick in Applescript too, but I doubt it would have a large positive impact. smile


I know, I was maybe ahead of you and I meant that there would no performance gain for the "v1 and v0" solution either. Even swapping pointers, how small it is in execution time, no swapping is faster. When using static arrays, using stack memory instead of heap, does have a lot of performance gains but as always, using stack memory limits your buffers. It's an unwritten rule not to use stack memory for these kind of data.


Maybe you were ahead of me, I don't know, I just had the idea of using static storage, whether by declaring reasonably sized arrays up front, or allocating memory from the heap at program startup. And just for the record, Nigel seem to use the "pointer-flipping" optimization already, and I just named them v0 and v1, because I had forgotten what Nigel called them. I mentioned it, because you worried over bandwidth and memory allocations. My point is that those two problems can be almost optimized away with respect to variables that the function uses, save during program load and exit. I had no intention, what so ever of allocating the arrays on the stack. smile

Edit

I too are aware of the fact that all static variables are created in the data segment? (some segment, anyway), so, no matter how big static variables I declare in a C-function, they will not be allocated on the program stack at run time, like automatic variables.

Last edited by McUsrII (2015-06-01 02:50:20 pm)

Offline

 

#21 2015-06-01 07:03:03 pm

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2724
Website

Re: Approximate string matching: The Levenshtein distance

McUsrII wrote:

I too are aware of the fact that all static variables are created in the data segment? (some segment, anyway), so, no matter how big static variables I declare in a C-function, they will not be allocated on the program stack at run time, like automatic variables.


Just for the record, that behaviour depends on the compiler (settings). Statics can be assigned to the data (initialized), bss(uninitialized) and code(both) segment of the executable. Also it would make no difference whether or not to use statics in this situation, unless you make it recursive. Therefore my reference to stack memory was more general in terms as a fixed memory space that is not heap memory and are all bound to the same limitations.

I don't worry over bandwidth at all smile, I think you misunderstood me. My point was that even when using Nigel's approach you would use less memory even without gaining any performance. The reason for me to say that explicitly was because the topic looks like (read the timings) that the best implementation is the fastest, without regards to quality of the code itself. So without timing any of the solutions I liked Nigel's script the most, even if it was slower (which is not).

Last edited by DJ Bazzie Wazzie (2015-06-02 03:25:37 am)

Offline

 

#22 2015-06-02 02:32:14 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Hello.

I agree in that Nigels second version, is also the most readable, so that is a good reason for me to like it to. After all, performance is just a currency, that we use to "buy" security and functionality for, so it is important, but not the most important property of a routine really, of course, unless it really is, for one routine. smile

Offline

 

#23 2015-06-03 07:28:59 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4424

Re: Approximate string matching: The Levenshtein distance

A couple of further possibilities:

1. Make 'charList1' and 'charList2' the 'id' of the respective strings rather than the 'characters'. This will make the character comparisons sensitive to everything and speed them up because they'll be integer rather than text comparisons.

2. Test the strings (or the 'id' lists) for equality before proceeding with the rest of the process. This will greatly speed up the return of 0 for equal strings without noticeably slowing the treatment of unequal strings.

Applescript:

set string1 to "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences."
--same as string 1 but with a typos (deletion, insertion, altering and transposition)
set string2 to "In informatoin theory and computer science, thee Levenshtein distance is a string metric far measuring the diffrence between two sequences."

set edits to levenshteinDistance(string1, string2)

on levenshteinDistance(s1, s2)
   set xLen to (count s1)
   set yLen to (count s2)
   
   script o
       property charList1 : id of s1 -- For everything sensitivity …
       property charList2 : id of s2 -- … and speed.
       property previousRow : missing value
       property currentRow : missing value
   end script
   
   -- Return 0 straight away if the two strings are equal.
   if (o's charList1 = o's charList2) then return 0
   
   -- Otherwise intitialise two "row" lists as the first row of a notional matrix.
   set o's previousRow to {0} & o's charList1
   repeat with x from 1 to xLen
       set item (x + 1) of o's previousRow to x
   end repeat
   set o's currentRow to o's previousRow's items
   -- Handle the remaining rows in a rolling manner, the two lists alternating as previous and current rows.
   repeat with y from 1 to yLen
       set item 1 of o's currentRow to y
       repeat with x from 1 to xLen
           set deletion to (item x of o's currentRow) + 1
           if (item x of o's charList1 is item y of o's charList2) then
               set alternate to (item x of o's previousRow)
           else
               set alternate to (item x of o's previousRow) + 1
           end if
           set min to (item (x + 1) of o's previousRow) + 1
           if (deletion < min) then set min to deletion
           if (alternate < min) then set min to alternate
           
           set item (x + 1) of o's currentRow to min
       end repeat
       
       tell o's previousRow
           set o's previousRow to o's currentRow
           set o's currentRow to it
       end tell
   end repeat
   
   return end of o's previousRow
end levenshteinDistance


NG

Offline

 

#24 2015-06-03 08:29:52 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Hello Nigel.

Great ideas for optimizing the handler. I at least can se no reason why anybody would want such a handler to be case insensitive. I haven't timed it, but I wager it is *alot* faster. smile

Offline

 

#25 2015-06-03 08:57:35 am

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2724
Website

Re: Approximate string matching: The Levenshtein distance

McUsrII wrote:

I at least can se no reason why anybody would want such a handler to be case insensitive.


Spelling corrections, where it is used most, is a mix of case sensitive and case insensitive. Also my finished project had to be case insensitive. However, when you still want it case insensitive you can convert both strings to the same case (lowercase) before calling this function.

Offline

 

Board footer

Powered by FluxBB

RSS (new topics) RSS (active topics)