Find Paragarph with Matching String without Repeat

Jeffkr · September 7, 2022, 4:53pm

Hello all,
I am trying to figure out a way to find the paragraph number of a text file that contains a matching string.
This script below works well, but only for smaller text files. I am hoping somebody knows how to find a matching string in a very large text file witout the need to use a Repeat loop?

Thanks,
-Jeff

set myfile to ((path to desktop as text) & "Test.txt")
set theText to (read file myfile)
set myID to "100658"

repeat with i from 1 to count of paragraphs in theText
	if paragraph i of theText contains myID then
		set myCount to count of paragraph i of theText
	end if
end repeat

display dialog myCount as string

robertfern · September 7, 2022, 5:00pm

Enjoy

set myfile to ((path to desktop as text) & "Test.txt")
set theText to (read file myfile)
set myID to "100658"

set myCount to count (paragraphs of (text 1 thru (offset of myID in theText) of theText))

display dialog myCount as string

Jeffkr · September 7, 2022, 5:38pm

Unbelievable! Thank you so very much Robert!

wch1zpink · September 8, 2022, 5:52pm

Even though the solution posted by @robertfern is a perfectly valid working solution… If the text file you are searching contains multiple paragraphs with “100658”, his solution will return only the paragraph number of the first instance of your search term.

This following code will return the paragraph number of every instance of your search term.

set myFile to quoted form of POSIX path of (path to desktop as text) & "Test.txt"
set myID to quoted form of "100658"

set paragraphNumber to do shell script "grep -no " & myID & ¬
	" " & myFile & " |sed -E -e 's/" & ":" & myID & "//g'"

set text item delimiters to ", "
display dialog words of paragraphNumber as text

robertfern · September 8, 2022, 9:02pm

Not bad.

Here is a script that will find all instances but it uses a repeat loop

set myfile to ((path to desktop as text) & "Test.txt")
set theText to (read file myfile)
set myID to "100658"

set text item delimiters to myID
set textList to text items of theText
set {mycount, myCounts} to {0, {}}
repeat with i from 1 to (count textList) - 1
	set mycount to mycount + (count (paragraphs of (item i of textList)))
	set end of myCounts to mycount
	set mycount to mycount - 1
end repeat
--set mycount to count (paragraphs of (text 1 thru (offset of myID in theText) of theText))
set text item delimiters to ", "
display dialog myCounts as text

peavine · September 9, 2022, 4:35pm

Robert. Your script is ingenious but I wondered how it would fare with a large string. So, I ran a test.

My test string contained 4096 paragraphs and every fourth paragraph matched the substring (i.e. myID). I tested this against:

set myID to "xx"
set textList to paragraphs of theText
set mycount to {}
repeat with i from 1 to (count textList)
	if item i of textList contains myID then set end of mycount to i
end repeat

My script took 408 milliseconds and your script took 40 milliseconds. Very nice

robertfern · September 9, 2022, 5:06pm

Shocking, I know.

There have been many times in this forum where someone would give a sample script using “do shell script”. There are a lot of places where this turns out to be the fastest, but there is a lot of overhead in going out to the shell that AppleScript native solution ends up being faster.

Sometimes its even faster than ASObjC solutions.

Never sure till you try tho.

wch1zpink · September 10, 2022, 5:49pm

I’m not quite sure why you guys are so against my “do shell script” solution. Call me crazy but not only does my solution use less code (making it easier to understand in my opinion), in this instance, the “do shell script” solution seems to be the quickest.

I ran this following test on a text file containing 4320 paragraphs with myID appearing in every fourth paragraph.

robertfern · September 11, 2022, 12:40am

Can I get a copy of your text file for me to test with?

Also, no one is against your version. I misread the previous post and assumed he meant the do shell script version.

It appears yours is the fastest. We just like testing.
Also I’m of the option that I prefer all AppleScript solutions best and only use the shell when no other option exists. Thats just me. Plus it’s fun

Also here is a faster version of my script using script objects

Speed test this for me with your text file.

script M
	property myCounts : missing value
	property textList : missing value
end script
set M's myCounts to {}
set myfile to ((path to desktop as text) & "Test.txt")
set theText to (read file myfile)
set myID to "100658"
set text item delimiters to myID
set M's textList to text items of theText
set mycount to 0
repeat with i from 1 to (count M's textList) - 1
	set mycount to mycount + (count (paragraphs of (item i of M's textList)))
	set end of M's myCounts to mycount
	set mycount to mycount - 1
end repeat
set text item delimiters to ", "
display dialog M's myCounts as text

Here is my timing results with a 4000 paragraph file.

Macmini8,1, macOS Version 11.6.8 (Build 20G730), 100 iterations
First Run Total Time Average
First 0.028 2.794 0.028 ← Script with “Do Shell Script”
Second 0.017 1.822 0.018 ← My script
Ratio (excluding first run): 1.53:1

wch1zpink · September 11, 2022, 2:12am

4320 paragraphs with myID appearing in every fourth paragraph.

After removing these following last two lines from both scripts…
[format]
set text item delimiters to ", "
display dialog M’s myCounts as text[/format]

These were the results. Very very close

MacBookPro15,1, macOS Version 12.5.1 (Build 21G83), 300 iterations
First Run Total Time Average
First 0.012 3.310 0.011 ← AppleScript using “do shell script”
Second 0.012 3.705 0.012 ← AppleScript using script objects
Ratio (excluding first run): 1:1.12

peavine · September 11, 2022, 1:03pm

I edited my earlier script to use an implicit script object and I tested all the script suggestions.

SCRIPT - MILLISECONDS
Peavine one - 418
Robert one - 41
Peavine two - 11
wch1zpink - 9
Robert two - 4

My new script:

set theFile to ((path to desktop as text) & "Test.txt")
set theString to paragraphs of (read file theFile)
set theSubstring to "xx"
set theCount to my {}

repeat with i from 1 to (count my theString)
	if item i of my theString contains theSubstring then set end of my theCount to i
end repeat
return theCount

The use of script objects to speed list access is undocumented, and some may instead prefer to use the “a reference to” operator, which is documented. It’s a bit slower at 26 milliseconds, though.

set theFile to ((path to desktop as text) & "Test.txt") as alias
set theString to paragraphs of (read theFile)
set theStringRef to a reference to theString
set theCount to {}
set theCountRef to a reference to theCount
set theSubstring to "xx"

repeat with i from 1 to (count theStringRef)
	if item i of theStringRef contains theSubstring then set end of theCountRef to i
end repeat
theCount

A discussion of the use of “a reference to” operators to speed list access can be found under the index item “large lists” in the AppleScript Reference Guide:

https://developer.apple.com/library/archive/documentation/AppleScript/Conceptual/AppleScriptLangGuide/reference/ASLR_classes.html#//apple_ref/doc/uid/TP40000983-CH1g-DontLinkElementID_588

KniazidisR · September 11, 2022, 4:05pm

I have already seen several similar cases where the use of AppleScript text item delimeters outperforms the shell method and even the AsObjC method in speed.

In this case, that is exactly what happens. The “poor use” of the repeat loop is offset by the enormous speed of AppleScript text item delimeters (in @robertfern’s solution).

But the shell version should still win in terms of speed with an increase in the number of finds of the desired substring. Therefore, in my opinion, the solution from @wch1zpink is the most optimal in this case, as it is the most efficient and concise.

wch1zpink · September 12, 2022, 4:50pm

I figured you guys may find these following results interesting.

The text file I ran tests against contained only the following 4 lines… Over and over again

[format]Here is a script that will find all instances but it uses a repeat loop.
My test string contained 4096 paragraphs and every fourth paragraph matched the substring.
Your script is ingenious but I wondered how it would fare with a large string.
I am trying to figure out a way to find the paragraph number of a text file that contains 100658.[/format]

Test.txt with 4400 paragraphs and 100658 in every 4th paragraph

MacBookPro15,1, macOS Version 12.5.1 (Build 21G83), 100 iterations
First Run Total Time Average
First 0.014 1.350 0.014 ← AppleScript using “do shell script” - wch1zpink
Second 0.045 4.631 0.046 ← AppleScript using “script objects” - robertfern
Ratio (excluding first run): 1:3.43

Test.txt with 57148 paragraphs and 100658 in every 4th paragraph

MacBookPro15,1, macOS Version 12.5.1 (Build 21G83), 100 iterations
First Run Total Time Average
First 0.119 8.933 0.089 ← AppleScript using “do shell script” - wch1zpink
Second 0.748 67.303 0.673 ← AppleScript using “script objects” - robertfern
Ratio (excluding first run): 1:7.53

It seems to me as if the larger the list, the more inefficient AppleScript using “script objects” becomes.

These were the two scripts I ran the tests on:

set myFile to quoted form of POSIX path of (path to desktop as text) & "Test.txt"
set myID to quoted form of "100658"

set paragraphNumber to do shell script "grep -no " & myID & ¬
	" " & myFile & " |sed -E -e 's/" & ":" & myID & "//g'"

set text item delimiters to ", "
log words of paragraphNumber as text

AND

script M
	property myCounts : missing value
	property textList : missing value
end script
set M's myCounts to {}
set myfile to ((path to desktop as text) & "Test.txt")
set theText to (read file myfile)
set myID to "100658"
set text item delimiters to myID
set M's textList to text items of theText
set mycount to 0
repeat with i from 1 to (count M's textList) - 1
	set mycount to mycount + (count (paragraphs of (item i of M's textList)))
	set end of M's myCounts to mycount
	set mycount to mycount - 1
end repeat
set text item delimiters to ", "
log M's myCounts as text

Mockman · September 13, 2022, 10:43am

Here is a minor variation on the shell method:

set myFile to quoted form of POSIX path of ((path to desktop as text) & "Test copy.txt")
set myID to quoted form of "100658"

set paragraphNumber to do shell script "grep -Fn " & myID & ¬
	" " & myFile & " | cut -d ':' -f1"
-- resulting Terminal command
-- "grep -Fn '100658' '/Users/<user>/Desktop/Test copy.txt' | cut -d ':' -f1"

display dialog paragraphNumber as text

For the grep command, I made a couple of tweaks. I added the -F option which forces the use of fgrep, which is ostensibly quicker as it only handles fixed (i.e. non regex) patterns. Next, I removed the -o option as it is extraneous to getting the line number.

Instead of sed, I used cut to split each result on the colon and discard the ID number. Cut is simpler and presumably lighter than sed so maybe it will improve the speed.

I also moved the source file name inside the quotes, in case it contains spaces.

wch1zpink · September 13, 2022, 9:11pm

Great job dude! Your tweaks and version blew mine out of the water!

Test.txt with 57148 paragraphs and 100658 in every 4th paragraph.

MacBookPro15,1, macOS Version 12.5.1 (Build 21G83), 500 iterations
First Run Total Time Average
First 0.118 40.577 0.081 ← wch1zpink
Second 0.029 13.615 0.027 ← Mockman
Ratio (excluding first run): 2.98:1

Mockman · September 13, 2022, 10:03pm

That’s kind of neat but also a bit surprising. I wonder which change (fgrep or sed/cut) made the bigger difference.

Thanks for doing the timing for it.