Searching into list quickly

Hey all,

I have a long list of names (>3000).
I want to search if a name is into this list (the true is that the script is a bit more complicated than this; see here https://macscripter.net/viewtopic.php?id=46174).

But anyway…
I am using brute force approach, meaning repeat with every item in the list loop.
(Thanks to DJ Bazzie Wazzie, my script is a bit faster than before … but still too slow)

Questions 1): if I sort the list alphabetically, can I take advantage of this? How?

Questions 2): there is an algorithm that search into half of the list, if your searched item is not there, then split the rest in 2 … and so forth. Can I take advance of this? How?

Any example of both approaches is much appreciated.

Thanks
L.

I think you misunderstood my code in the previous code. There is simply no faster approach than in the previous post and a 100.000 long list is still looked up in < 0.01 seconds. The whole idea was to eliminate the nested loop and getting rid of it.

But to give an correct answer to this topic. You can lookup values in an AppleScript list by using is in comparison:

"tom" is in {"John", "Bill", "Ottoman", "Tom"}

I did used your idea to remove the nested loop (see script below). Thanks a lot for that!
But is is still slow for a >3000 items.
So i was thinking to combining it with some other trick, as I mention in my previous post

Using the ““tom” is in {“John”, “Bill”, “Ottoman”, “Tom”}”, as you are suggesting, it won’t give me the position within the list where the item is located, which I need to retrieve the the number of occurrence of that word.

property p_the_path : missing value
property p_newFileName : missing value
property p_theList2Flat : ""

set theSourceList to (choose file with prompt "Select a file to import the List1:")
open for access theSourceList
set fileContents to (read theSourceList)
close access theSourceList
set theList1 to every paragraph of fileContents

set theSourceList2 to (choose file with prompt "Select a file to import the List2:")
open for access theSourceList2
set fileContents2 to (read theSourceList2)
close access theSourceList2
set theList2 to every paragraph of fileContents2
set p_theList2Flat to (fileContents2) as string -- we cannot use a List, we need a flat text

# let's prepare the name for saving the DeltaList into a file 
tell (info for theSourceList) to set {Nm, Ex} to {name, name extension}
set OnlyName to text 1 thru ((get offset of "." & Ex in Nm) - 1) of Nm

tell (info for theSourceList2) to set {Nm2, Ex2} to {name, name extension}
set OnlyName2 to text 1 thru ((get offset of "." & Ex2 in Nm2) - 1) of Nm2
set newFileName to Nm & "_" & Nm2
set p_newFileName to newFileName

set deltaList to {}
-- set the_totalCountOfList1 to 0 -- we need this to calculte total number of words in List1
set the_totalCountOfList1 to WordsInDeltaList(theList1)
set the_totalCountOfList2 to WordsInDeltaList(theList2)
set the_totalCountOfDeltaList to 0 -- we need this to calculte total number of words in DeltaList

repeat with a from length of theList1 to 1 by -1 -- removing partial duplicate
	set theCurrentListItem to item a of theList1
	set the_offset to offset of " " in theCurrentListItem -- we look for the " " (space) char in each line 
	set the_Word to text 1 thru (the_offset - 2) of theCurrentListItem
	set the_Count to text (the_offset + 1) thru -1 of theCurrentListItem
	--set the_totalCountOfList1 to the_totalCountOfList1 + the_Count -- we are counting the total number of words in order to calculate the similary score
	set the_Count2 to getValueFromTable(the_Word, p_theList2Flat) -- return the total count of the_Word in List 2
	
	set DeltaCount to (the_Count as integer) - (the_Count2 as integer)
	if DeltaCount < 0 then set DeltaCount to -DeltaCount -- we dont want negative values and the word was present in both lists, although more in List2
	if DeltaCount > 0 then
		set item a of theList1 to the_Word & ": " & DeltaCount -- we update the item in List1. The same item in List 2 was deleted in the getValueFromTable subruoutines
	end if
	if DeltaCount = 0 then -- we can delete both from lists. In List2 we do already in the getValueFromTable subruoutines
		set theList1 to remove_item_from_list(theList1, a)
	end if
end repeat

# let's clean theList2 from empty lines
set theList2New to (every paragraph of p_theList2Flat) -- this include empty lines
set theList2NewClean to {}
repeat with a_paragraph in theList2New
	set the_new_string to ""
	set my_string to a_paragraph as string
	-- check for blank lines and skip/discard
	if my_string is not "" then
		set end of theList2NewClean to my_string
	end if
end repeat

set deltaList to theList1 & theList2NewClean -- this NOT include empty lines
set the_totalCountOfDeltaList to WordsInDeltaList(deltaList)

# we need to save de DeltaFile
set the_path to (":Users:ldicroce:Desktop:Coverted_Files:" & p_newFileName)
set p_the_path to the_path
set deltaList to Export_list(deltaList, (length of deltaList))

if not (the_totalCountOfDeltaList = 0) then
	set HowSimilar to (100 - ((the_totalCountOfDeltaList * 100) / (the_totalCountOfList1 + the_totalCountOfList2)))
	display dialog "This files are " & (round_to_decimal_places(HowSimilar, 2) as string) & "% similar."
else -- are identical
	-- do something here
end if

on getValueFromTable(theName, theTable) -- this is the slow part when it works on very similar files, since it enter the internal loop
	set text item delimiters to ""
	set theTable to return & theTable
	set theName to return & theName & ":"
	tell AppleScript
		set oldTIDs to text item delimiters
		set text item delimiters to theName
		set TIs to text items of theTable -- if theName is found we split theTable into 2 items
		set text item delimiters to oldTIDs
	end tell
	
	if (count of TIs) is 2 then -- we found theName only once!
		tell AppleScript -- we use this to remove this line from this list2
			set oldTIDs to text item delimiters
			set text item delimiters to (theName & " " & (word 1 of item 2 of TIs))
			set TI2s to text items of theTable
			set text item delimiters to oldTIDs
		end tell
		try
			set p_theList2Flat to ((lines 2 thru -1 of text item 1 of TI2s) as string) & ((text item 2 of TI2s) as string) -- we remove the line
			# we had to remove the extra empty line that we include each time, that's why we start from Paragraph 2
		on error
			set p_theList2Flat to ((text item 1 of TI2s) as string) & ((text item 2 of TI2s) as string) -- this is necessary if we are processing the last line
		end try
	return word 1 of item 2 of TIs -- return the number of times theName appear in this List
	end if
	if (count of TIs) is 1 then
		-- name is not in table return empty value 
		return 0
	else if (count of TIs) > 2 then -- if this is the case we have splitted in more than 2
		-- name is not unique in table, return value that indicates that there is something wrong
		return false
	end if
end getValueFromTable


on Export_list(the_list, NumItems)
	-- set myExportedList to open for access (choose file name) with write permission
	set myExportedList to open for access file p_the_path with write permission
	
	repeat with itemNum from 1 to NumItems
		write ((item itemNum of the_list as string) & return) to myExportedList 
	end repeat
	close access myExportedList
end Export_list

on remove_item_from_list(the_list, index_to_remove)
	if index_to_remove = 1 then
		set new_list to rest of the_list
	else if index_to_remove = length of the_list then
		set new_list to (items 1 thru -2 of the_list)
	else
		set new_list to (items 1 thru (index_to_remove - 1) of the_list) & (items (index_to_remove + 1) thru -1 of the_list)
	end if
	return new_list
end remove_item_from_list

on WordsInDeltaList(theList)
	set sumCount to 0
	repeat with a from length of theList to 1 by -1 -- removing partial duplicate
		set theCurrentListItem to item a of theList
		-- Process the current list item
		set the_offset to offset of " " in theCurrentListItem -- we look for the " " (space) char in each line 
		-- set the_Word to text 1 thru the_offset of theCurrentListItem
		set the_Count to text (the_offset + 1) thru -1 of theCurrentListItem
		set sumCount to sumCount + (the_Count as integer)
	end repeat
	return sumCount
end WordsInDeltaList

on round_to_decimal_places(the_number_to_round, the_decimal_precision)
	set multiplier to 10 ^ the_decimal_precision
	the_number_to_round * multiplier
	round result
	return result / multiplier
end round_to_decimal_places

Looking at the code I see some improvements that can be made.

Offset uses an external AppleEvents which is relatively time consuming so an custom handler using text item delimiters is actually faster.

on customOffset(str, subStr)
	if subStr is not in str then return 0
	tell AppleScript
		set oldTIDs to text item delimiters
		set text item delimiters to subStr
		set distance to count text item 1 of str
		set text item delimiters to oldTIDs
	end tell
	return distance + 1
end customOffset

But I also noticed that your doing the same loop for different tasks like counting, matching and cleaning empty lines. Can’t this be done in one single go to make things more efficient and eventually faster?

But if you prefer to make it work this way you can use AppleScript Toolbox (AST) to manage lists easier. It has the AST copy list command where you can instruct which items you want to keep or omit by index. So you keep track of all the items that needs to be removed and remove the items in a single command:

set theList to {"1", "2", "3", "4", "5", "6", "7", "8", "9"}
AST copy list theList omitting item {6, 3, 5}
-- Result: {"1", "2", "4", "7", "8", "9"}

The same is for removing empty lines which can be done in a single command using AST:

set theLines to {"John: 1", "Bill: 2", "", "Mark: 4", "David: 5", "", "Peter: 7", "Kyle: 8", "George: 9"}
AST copy list theLines matching regex ".+"
-- Result: {"John: 1", "Bill: 2", "Mark: 4", "David: 5", "Peter: 7", "Kyle: 8", "George: 9"}

Summarizing all values can be done on file contents instead of line by line using AST as well:

set theText to "John: 1
Bill: 2
Mark: 4
David: 5
Peter: 7
Kyle: 8
George: 9"

set theValues to AST find regex ": ([0-9]+)$" in string theText regex group 2 with newline sensitivity
set theTotal to run script join(theValues, "+")
-- Result: 36

on join(lst, del)
	tell AppleScript
		set oldTIDs to text item delimiters
		set text item delimiters to del
		set str to lst as string
		set text item delimiters to oldTIDs
	end tell
	return str
end join

I’ll hope this gets you started

I’m not sure of your example’s logic in the original post.

Source:
Bill: 4 Bill: 4
Coen: 2 Coen:1
Mark: 1 Mark: 3
Steve:4 Stefan: 2
Tom: 3 Tom: 3

Yield:
Coen: 1
Mark: -2
Stefan: 2
Steve: 4

How is Mark’s number -2? Should Coen 1 and 2 both be listed, since they have different values? Perhaps the code is as simple as combining two plain text files (through concatenation) and using Uniq against them.

Source:
Bill: 4
Bill: 4
Coen: 2
Coen:1
Mark: 1
Mark: 3
Steve:4
Stefan: 2
Tom: 3
Tom: 3

do shell script "uniq -u " & (choose file)'s POSIX path's quoted form

Yield:
Coen: 2
Coen:1
Mark: 1
Mark: 3
Steve:4
Stefan: 2

Hey DJ Bazzie Wazzie,

Thanks a lot once again. I will definitely will use your “on customOffset(str, subStr)” subroutines.
And the AST seems very interesting. Is there any documentations/Guide for the different AST commands? I checked your website, but I couldn’t find any.

Ciao.
L.

Hey Marc Anthony,

basically I have 2 PDFs, which I convert into text, then I count how many times each word is present in each PDF (This is basically the list of words I mentioned in my post).

Then I want to “measure” how similar are the 2 original PDFs by counting their differences in their word counts.
If the 2 PDFs are the same the 2 lists of words are identical, and the final list should be empty.
But if they are similar I want to know which words they differs by calculating the difference in the counts.

Tour approach is differs than then one I am developing, but also very interesting and most likely much faster.
Considering that I have 2 files (ListWordCounts_1.txt and ListWordCounts_2.txt) can I use “do shell script” to grab the differences (as I mentioned above)? if yes, how?
Thanks a lot !
L.

It’s probably going to be quicker to do the counting at the same time. Something like this:

use AppleScript version "2.4"
use scripting additions
use framework "Foundation"
use framework "AppKit"
use framework "Quartz"

set thePath1 to "/path/to/first file.pdf"
set thePath2 to "/path/to/second file.pdf"
set {countedSet1a, countedSet1b} to my countedSetFromPDFAt:thePath1 minWordLength:4 -- change word length to suit
set {countedSet2a, countedSet2b} to my countedSetFromPDFAt:thePath2 minWordLength:4
-- subtract words in second doc from first
countedSet1a's minusSet:countedSet2b
-- subtract words in first doc from second
countedSet2a's minusSet:countedSet1b
set theList to {} -- holds result
-- loop through what's left in the doc 1 list, which will be words that appear more times in doc 1 than doc 2
set theObjects to countedSet1a's allObjects()
repeat with anObject in theObjects
	set end of theList to (anObject as text) & ": " & (countedSet1a's countForObject:anObject)
end repeat
-- loop through what's left in the doc 2 list, which will be words that appear more times in doc 2 than doc 1
set theObjects to countedSet2a's allObjects()
repeat with anObject in theObjects
	set end of theList to (anObject as text) & ": -" & (countedSet2a's countForObject:anObject)
end repeat
-- convert list to text
set saveTID to AppleScript's text item delimiters
set AppleScript's text item delimiters to {linefeed}
set theEntries to theList as text
set AppleScript's text item delimiters to saveTID
return theEntries

on countedSetFromPDFAt:thePath minWordLength:minLength
	-- get text of PDF
	set anNSURL to current application's |NSURL|'s fileURLWithPath:thePath
	set theDoc to current application's PDFDocument's alloc()'s initWithURL:anNSURL
	set theText to theDoc's |string|()
	-- break into words
	set theText to theText's stringByReplacingOccurrencesOfString:"\\W+" withString:space options:(current application's NSRegularExpressionSearch) range:{0, theText's |length|()}
	set theWords to theText's componentsSeparatedByString:space
	-- remove words too small
	set thePred to current application's NSPredicate's predicateWithFormat:"length >= %@" argumentArray:{minLength}
	set theWords to theWords's filteredArrayUsingPredicate:thePred
	-- make two identical counted sets; these store words plus the number of times they appear
	set countedSet1 to current application's NSCountedSet's setWithArray:theWords
	set countedSet2 to current application's NSCountedSet's setWithArray:theWords
	return {countedSet1, countedSet2}
end countedSetFromPDFAt:minWordLength:

If you download AST you can read the dictionary by opening the osax with AppleScript Editor.

This is absolutely mind-blowing!

I tried to understand your script (at least part of it), and it seems to me that the final result (return theEntries) is a list of unique words (with -1 or +1 depending if present in PDF_1 or in PDF_2
So we loose those words that are “differentially” present.
Is that correct ? there is any way to extract this “differentially” information as well?
Those are the examples I have used:

PDF1 PDF2 Expected Obtained

Babbb: 1 Babbb: 3 Barnab: 2 (or -2) Salzz: 1
Barnab: 1 Barnab: 3 Bennn: 1 Chris: 1
Bennn: 4 Bennn: 3 Chris: 5 Salzza: -1
Chris: 5 David: 1 Marc: 4 Marc: -1
David: 1 Marc: 4
Salzz: 2 Salzza: 2

Thanks !

No. You’re supposed to run it with the original PDFs where the words actually appear multiple times, not on summaries of the count. As I said, let it do the counting.

Amazing!!!
I only added one more line in the subroutine
set p_WordCounted to the number of items of theWords
to keep count of the total number of words which I need to estimate the similarity between the PDF files.
Anyway this leave me with a double feeling.
Very happy to have solved the problem, but depressed realising how little I know of programming.
Thanks a lot

I wouldn’t measure programming skills based on performance in AppleScript. It’s full of quirks and hacks which requires knowledge that has nothing to do with programming itself. While there is some redundancy, your first code is actually the best when you look at the paradigm of the programming language, non of our examples are better. So don’t be too hard on yourself :slight_smile:

Thanks for the support !!