Just for the record, that behaviour depends on the compiler (settings). Statics can be assigned to the data (initialized), bss(uninitialized) and code(both) segment of the executable. Also it would make no difference whether or not to use statics in this situation, unless you make it recursive. Therefore my reference to stack memory was more general in terms as a fixed memory space that is not heap memory and are all bound to the same limitations.
I don’t worry over bandwidth at all :), I think you misunderstood me. My point was that even when using Nigel’s approach you would use less memory even without gaining any performance. The reason for me to say that explicitly was because the topic looks like (read the timings) that the best implementation is the fastest, without regards to quality of the code itself. So without timing any of the solutions I liked Nigel’s script the most, even if it was slower (which is not).
I agree in that Nigels second version, is also the most readable, so that is a good reason for me to like it to. After all, performance is just a currency, that we use to “buy” security and functionality for, so it is important, but not the most important property of a routine really, of course, unless it really is, for one routine.
Make ‘charList1’ and ‘charList2’ the ‘id’ of the respective strings rather than the ‘characters’. This will make the character comparisons sensitive to everything and speed them up because they’ll be integer rather than text comparisons.
Test the strings (or the ‘id’ lists) for equality before proceeding with the rest of the process. This will greatly speed up the return of 0 for equal strings without noticeably slowing the treatment of unequal strings.
set string1 to "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences."
--same as string 1 but with a typos (deletion, insertion, altering and transposition)
set string2 to "In informatoin theory and computer science, thee Levenshtein distance is a string metric far measuring the diffrence between two sequences."
set edits to levenshteinDistance(string1, string2)
on levenshteinDistance(s1, s2)
set xLen to (count s1)
set yLen to (count s2)
script o
property charList1 : id of s1 -- For everything sensitivity .
property charList2 : id of s2 -- . and speed.
property previousRow : missing value
property currentRow : missing value
end script
-- Return 0 straight away if the two strings are equal.
if (o's charList1 = o's charList2) then return 0
-- Otherwise intitialise two "row" lists as the first row of a notional matrix.
set o's previousRow to {0} & o's charList1
repeat with x from 1 to xLen
set item (x + 1) of o's previousRow to x
end repeat
set o's currentRow to o's previousRow's items
-- Handle the remaining rows in a rolling manner, the two lists alternating as previous and current rows.
repeat with y from 1 to yLen
set item 1 of o's currentRow to y
repeat with x from 1 to xLen
set deletion to (item x of o's currentRow) + 1
if (item x of o's charList1 is item y of o's charList2) then
set alternate to (item x of o's previousRow)
else
set alternate to (item x of o's previousRow) + 1
end if
set min to (item (x + 1) of o's previousRow) + 1
if (deletion < min) then set min to deletion
if (alternate < min) then set min to alternate
set item (x + 1) of o's currentRow to min
end repeat
tell o's previousRow
set o's previousRow to o's currentRow
set o's currentRow to it
end tell
end repeat
return end of o's previousRow
end levenshteinDistance
Great ideas for optimizing the handler. I at least can se no reason why anybody would want such a handler to be case insensitive. I haven’t timed it, but I wager it is alot faster.
Spelling corrections, where it is used most, is a mix of case sensitive and case insensitive. Also my finished project had to be case insensitive. However, when you still want it case insensitive you can convert both strings to the same case (lowercase) before calling this function.
Here is another idea, regarding spelling distances of filenames. The idea, is that a user types in a filename, or something else that is word sized, and you can iterate of the contents of the container of the items, and return the one with the least distance.
on spdist for s1 against s2
-- returns coarsely the differences between two strings
-- 0 if strings are identical
-- 1 if two chars are transposed, or one is different
-- 2 if one char added or deleted
-- 3 otherwise.
-- it is really meant solve problems with filenames that are
-- misspelled by one character
-- Ported from "The Unix Programming Environment" p.211
-- by B.W Kernighan and R. Pike.
script o
property l : s1's id
property m : s2's id
end script
set l1 to length of s1
set l2 to length of s2
if l1 = l2 then
set diffs to 0
set ante to 0
repeat with i from 1 to l1
if item i of o's l ≠item i of o's m then
if diffs = 0 then
if ante = 0 then -- first error
set ante to 1 -- 1 spelling error.
else if ante = 1 then
if item i of o's l = item (i - 1) of o's m and item (i - 1) of o's l = item i of o's m then
set diffs to 1 -- one full transpostion.
else
set diffs to 3 -- two spelling errors.
end if
set ante to -1
end if
else
set diffs to 3
end if
end if
end repeat
if diffs = 0 and ante = 1 then set diffs to 1
else
if l1 > l2 then
if (l1 - l2) < 2 then
set diffs to 2
else
set diffs to 3
end if
else
if (l2 - l1) < 2 then
set diffs to 2
else
set diffs to 3
end if
end if
end if
return diffs
end spdist
Edit
So I mulled over it a little, and I changed the code to work similiar to the description, almost: A transpositon, may not be an actual transposition, and, a single misspelled character also counts as a transposition. So in the worst case, two independently misspelled characters counts as one transposition.
Edit++
It is changed slightly so it now differs between a “true” transposition, and different characters at two different places.
I just found out that Levenshtein distances may be used for computing queries. I just provide a link for those interested. Levenshtein Automata (2010) | Hacker News. -You reach the original link by clicking on it at the top of the page.
Just mentioning it, I use the fact that something like this have been used by Google, many times, when I wonder what an english word means, but haven’t got the spelling right. Last one was earlier today, I wondered what itinary meant, so I tried Spotlight first, but the spelling was wrong. Using Google however, gave me back results for itinerary immediately.
Edit
Building a spell checker, must be a very interesting project, with regards to the trade-offs of the datastructures involved. I see that nearly every one of them uses the Levenshtein distance.
They are, but they differ a lot on how they are implemented. Almost no spellings checker is actually processing entire words, the levenshtein itself is done in one of the last steps of the spelling check. Only pieces of words who differs is the levenshtein applied to. For instance when matching the typo “backkup” against “backup”, there is no levenshtein match required when the spelling dictionary is syllable based b-tree. It is when a typo like “backop” is made, then the levenshtein is used to match “op” against “up” (and all other possible remaining syllables).
While the algorithm is quite heavy to process, the implementations are done in such a way that the levenshtein is actually applied to a smaller piece of the word against an already narrowed down dictionary during preprocessing. This makes it possible to apply approximate matching and give suggestions on the fly.
Since you mentioned Google. It is using another algorithm than a spelling checker. Because Google is using statistics, the levenshtein will be indexed and therefore it doesn’t need to be processed each time you write a typo. Today’s suggestion can be somethings else tomorrow when using the same search string.
Yes, I am aware of the weighting of searchwords by google, I recently saw a layout of the search engine by the way, and it seemed pretty much like a monster to me. But I guess they did use something like Levenshtein, in the days, when they didn’t have statistics for everything.
Spell checking is an interesting subject, one theme is autocompletion, which interests me. There are several open source projects that can be scavenged for that purpose, as is aspell, ispell, and xcalibur for spellchecking of documents.
I’m pretty sure your implementation will be an efficient one, and good luck!