I’d like to have a string similarity comparison in a script I’m running that pulls data from an Excel document and then matches that data with a text list.
The issue I’m having is that a lot of items in one data set do not match their supposed matches in the second data set; for example, “Pete’s Pizza Ltd.” and “Pete’s Pizza Inc.” aren’t matching as a string (nor should they, of course) but I’d like a way to calculate the similarity of each item if a match isn’t found and prompt the user to select the highest percentage match.
I wrote a script that does that; it counts each character in each string then compares the number of incidences of every letter and produces a match percentage. In the above example, it returns a match percentage of 86%; if I try something like “Tom’s Pizza Ltd.” it only returns 73%.
The code I wrote ignores any non-alphabetic data, so it ignores punctuation, accented characters, etc. However, I think the code I wrote is probably rather poor, so I’m wondering if anyone knows of a better way to compare string similarity (perhaps built in?). I’ve searched and haven’t found any Applescript examples, but I have found a few Java ones.
It might be possible to identify matches using regex, but it would depend on the form of the data sets and on how the matches might be expected to differ. For example, using ASObjC:
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
-- Say you've got this line from an NSString version of the first data set.
set inputDatum to current application's class "NSString"'s stringWithString:("Pete's Pizza Ltd.")
-- And this is an NSString of the second data set.
set secondDataSet to current application's class "NSString"'s stringWithString:("Pete's Pizza House
Pete's Pizza Inc.
Tom's Pizza Ltd.
Fred's Pizzaria
Pizza the Action")
-- Get a copy of the line without "Ltd." or "Inc." at the end.
set rootName to inputDatum's stringByReplacingOccurrencesOfString:(" (?:Ltd|Inc).?$") withString:("") options:(current application's NSRegularExpressionSearch) range:({0, inputDatum's |length|()})
-- Use it to make a regex pattern which looks for a line consisting of that name and possibily either of those endings!
set hedgeBetRegex to current application's class "NSString"'s stringWithFormat_("(?mi)^%@(?: (?:Ltd|Inc)\\.?)?$", rootName)
-- Search for a match in the second data set.
set matchRange to secondDataSet's rangeOfString:(hedgeBetRegex) options:(current application's NSRegularExpressionSearch) range:({0, secondDataSet's |length|()})
-- If there is one, get the matched text. Otherwise "".
if (matchRange's |length|() > 0) then
set match to (secondDataSet's substringWithRange:(matchRange)) as text
else
set match to ""
end if
Or the same thing with the Satimage OSAX:
-- Say you've got this line from the text of of the first data set.
set inputDatum to "Pete's Pizza Ltd."
-- And this is the text of the second data set.
set secondDataSet to "Pete's Pizza House
Pete's Pizza Inc.
Tom's Pizza Ltd.
Fred's Pizzaria
Pizza the Action"
-- Get a copy of the line without "Ltd." or "Inc." at the end.
set rootName to (change " (?:Ltd|Inc).?$" into "" in inputDatum with regexp)
-- Use it to make a regex pattern which looks for a line consisting of that name and possibily either of those endings!
set hedgeBetRegex to "(?mi)^" & rootName & "(?: (?:Ltd|Inc)\\.?)?$"
-- Look for a match in the second data set. If there is one, get the matched text. Otherwise "".
try
set match to (find text hedgeBetRegex in secondDataSet with regexp and string result)
on error
set match to ""
end try
Assuming the differences are usually at the end of the name, something like this might be useful:
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions
set theString to "Pete's Pizza Inc."
set secondDataSet to "Pete's Pizza House
Pete's Pizza Inc.
Tom's Pizza Ltd.
Fred's Pizzaria
Pizza the Action"
set theString to current application's NSString's stringWithString:theString
set longestMatch to ""
set matchCount to 0
repeat with aPar in paragraphs of secondDataSet
set commPrefix to (theString's commonPrefixWithString:aPar options:(current application's NSCaseInsensitiveSearch)) as text
if length of commPrefix > matchCount then
set matchCount to length of commPrefix
set longestMatch to contents of aPar
end if
end repeat
return longestMatch