Append paragraph numbers to text

I need to append sequential numbers to the front of every paragraph of some text. One solution is to use basic AppleScript made faster with a script object:

set theFile to (choose file of type "txt")
set numberedList to getNumberedList(theFile)

on getNumberedList(theFile)
	set theText to paragraphs of (read theFile)
	script o
		property theParagraphs : theText
		property numberedParagraphs : {}
	end script
	repeat with i from 1 to (count o's theParagraphs)
		set end of o's numberedParagraphs to (i as text) & ". " & (item i of o's theParagraphs)
	end repeat
	return o's numberedParagraphs
end getNumberedList

I’d prefer to use ASObjC, and this can be done with:

use framework "Foundation"
use scripting additions

set theFile to POSIX path of (choose file of type "txt")
set numberedArray to getNumberedArray(theFile)

on getNumberedArray(theFile)
	set theString to current application's NSString's stringWithContentsOfFile:theFile encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
	set theDelimiters to (current application's NSCharacterSet's newlineCharacterSet())
	set theArray to (theString's componentsSeparatedByCharactersInSet:theDelimiters)
	set numberedParagraphs to current application's NSMutableArray's new()
	repeat with i from 1 to theArray's |count|()
		set aParagraph to (theArray's objectAtIndex:(i - 1))
		set aNumberedParagraph to current application's NSString's stringWithFormat_("%@. %@", i, aParagraph)
		(numberedParagraphs's addObject:aNumberedParagraph)
	end repeat
	return numberedParagraphs
end getNumberedArray

With a test file containing 1,441 paragraphs, the first script took 42 milliseconds and the second 160 milliseconds. I was able to make the second script about 25 percent faster by modifying the repeat loop in various respects, but that wasn’t enough to approach the timing result of the script object solution. Is there a faster ASObjC solution I’ve not considered? Thanks.

Thanks Fredrik71 for the response. I was hoping for a fast ASObjC solution but perhaps that doesn’t exist. The nl utility is certainly fast; the following only took 8 milliseconds with my test file.

use framework "Foundation"
use scripting additions

set theFile to POSIX path of (choose file of type "txt")
set numberedArray to getNumberedArray(theFile)

on getNumberedArray(theFile)
	set theString to do shell script "nl -b a -w 5 -n rn -s '. ' " & quoted form of theFile
	set theString to current application's NSMutableString's stringWithString:theString
	(theString's replaceOccurrencesOfString:"(?m)^\\h*" withString:"" options:1024 range:{0, theString's |length|()}) -- trim leading spaces if desired
	set theDelimiters to (current application's NSCharacterSet's newlineCharacterSet())
	set theArray to (theString's componentsSeparatedByCharactersInSet:theDelimiters)
end getNumberedArray

Hi peavine.

Here’s an alternative to your ASObjC that’s a bit faster on my Mojave machine but still not as fast as the vanilla code:

use framework "Foundation"
use scripting additions

set theFile to POSIX path of (choose file of type "txt")
set numberedArray to getNumberedArray(theFile)

on getNumberedArray(theFile)
	set theString to current application's NSMutableString's stringWithContentsOfFile:theFile usedEncoding:(missing value) |error|:(missing value)
	set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:"(?m)^" options:0 |error|:(missing value)
	script o
		property insertionPoints : ((theRegex's matchesInString:theString options:0 range:{0, theString's |length|()})'s valueForKey:"range") as list
	end script
	repeat with i from (count o's insertionPoints) to 1 by -1
		(theString's insertString:((i as text) & ". ") atIndex:(o's insertionPoints's item i's location))
	end repeat
	return (theString as text)'s paragraphs
end getNumberedArray

You could possibly make the vanilla version a tad faster still by putting the edited strings back into the source list instead of building a new one.

Addendum: I’ve discovered by chance this morning that (in Mojave at least) splitting text using the newlineCharacterSet doesn’t allow for the possibility of the text having CRLF line endings. In such cases, it’s split on both the returns and the linefeeds and every alternate item in the resulting array is an empty string. The two fastest ways round this seem to be either to filter the array after the split or to change the line endings before it. I can’t find any consistent time difference between the two, nor does checking if they’re necessary first seem to make any significant difference.

Filtering any empty strings from the array:

on getNumberedArray(theFile)
	set theString to current application's NSString's stringWithContentsOfFile:theFile encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
	set theDelimiters to current application's NSCharacterSet's newlineCharacterSet()
	set theArray to theString's componentsSeparatedByCharactersInSet:theDelimiters
	set theFilter to current application's NSPredicate's predicateWithFormat:"length > 0" -- Added.
	set theArray to theArray's filteredArrayUsingPredicate:theFilter -- Added.
	set numberedParagraphs to current application's NSMutableArray's new()
	repeat with i from 1 to theArray's |count|()
		set aParagraph to (theArray's objectAtIndex:(i - 1))
		set aNumberedParagraph to current application's NSString's stringWithFormat_("%@. %@", i, aParagraph)
		(numberedParagraphs's addObject:aNumberedParagraph)
	end repeat
	return numberedParagraphs as list
end getNumberedArray

Ensuring the line endings match those in newlineCharacterSet.

on getNumberedArray(theFile)
	set theString to current application's NSString's stringWithContentsOfFile:theFile encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
	set theString to theString's stringByReplacingOccurrencesOfString:(return & linefeed) withString:linefeed --Added.
	set theDelimiters to current application's NSCharacterSet's newlineCharacterSet()
	set theArray to theString's componentsSeparatedByCharactersInSet:theDelimiters
	set numberedParagraphs to current application's NSMutableArray's new()
	repeat with i from 1 to theArray's |count|()
		set aParagraph to (theArray's objectAtIndex:(i - 1))
		set aNumberedParagraph to current application's NSString's stringWithFormat_("%@. %@", i, aParagraph)
		(numberedParagraphs's addObject:aNumberedParagraph)
	end repeat
	return numberedParagraphs as list
end getNumberedArray

Thanks Nigel–your suggestion is a significant improvement over my ASObjC script. I reran my timing tests and the results were:

Basic AppleScript with script object (peavine) - 46 milliseconds
Basic AppleScript with script object (see below) - 46 milliseconds
ASObjC (peavine) - 180 milliseconds
ASObjC (Nigel) - 127 milliseconds
do shell script with nl utility - 9 milliseconds

The revised basic AppleScript script wasn’t any faster, but it’s a bit more compact, which is a win.

set theFile to (choose file of type "txt")
set numberedList to getNumberedList(theFile)

on getNumberedList(theFile)
	set theText to paragraphs of (read theFile)
	script o
		property theParagraphs : theText
	end script
	repeat with i from 1 to (count o's theParagraphs)
		set theParagraph to (i as text) & ". " & (item i of o's theParagraphs)
		set item i of o's theParagraphs to theParagraph
	end repeat
	return o's theParagraphs
end getNumberedList

I thought I had all bases covered with newlineCharacterSet, but I see now that CRLF is not included. Your second suggestion–involving the use of stringByReplacingOccurrencesOfString–seems a simple and effective fix.

Hello,

As I noticed in the past, coercion inside the repeat loop consumes time. So, I wrote following script which seems to be faster than provided above. It is faster about 4-5 times because no coercions applied inside the repeat loop. The tested text has 5473 paragraphs:


set theFile to alias "Apple HD:Users:123:Desktop:TESTfiles:file-sample_500kB.txt"
set numberedList to getNumberedList(theFile)

on getNumberedList(theFile)
	script o
		property theParagraphs : {}
		property numberedParagraphs : {}
	end script
	set o's theParagraphs to paragraphs of (read theFile)
	set aCount to count o's theParagraphs
	-- Generate numbers list in text form
	set o's numberedParagraphs to paragraphs of (do shell script "/usr/bin/jot " & quoted form of (aCount as Unicode text) & " '1' - '1'")
	-- Get numbered paragraphs for the original text
	repeat with i from 1 to aCount
		set item i of o's numberedParagraphs to item i of o's numberedParagraphs & ". " & (item i of o's theParagraphs)
	end repeat
	return o's numberedParagraphs
end getNumberedList

KniazidisR. Thanks for the suggestion. I tested your script and the result was 12 milliseconds.

@Fredrik71,

I compared my solution (that is, “jot”-solution) with your solution (that is, “nl”-solution) in the Script Geek.app (10 runs). My solution was 1,3 faster for 500KB text file (5473 paragraphs). Maybe, “nl”-solution became faster on huge text files. “nl”-solution was tested in this form:


use framework "Foundation"
use scripting additions

set theFile to "/Users/123/Desktop/TESTfiles/file-sample_500kB.txt"
set numberedArray to getNumberedArray(theFile)

on getNumberedArray(theFile)
	set theString to do shell script "nl -b a -w 5 -n rn -s '. ' " & quoted form of theFile
	set theString to current application's NSMutableString's stringWithString:theString
	(theString's replaceOccurrencesOfString:"(?m)^\\h*" withString:"" options:1024 range:{0, theString's |length|()}) -- trim leading spaces if desired
	set theDelimiters to (current application's NSCharacterSet's newlineCharacterSet())
	set theArray to (theString's componentsSeparatedByCharactersInSet:theDelimiters)
end getNumberedArray

NOTE: I will test on bigger text files to ensure.

UPDATE (additional testing result):

I tested both scripts with the largest txt file (Bible) which is 4MB and 30384 paragraphs. Again, in Script Geek.app, 10 runs.

To my amazement, the “nl” solution did not perform better, but much worse - 3.6 times slower than the “jot” solution.

When I added the conversion of the NSString array to a list, the result was even worse - 4.6 times slower than the “jot” solution.

Sed appears to be another good alternative. It took 9 milliseconds with my test file.

set theFile to POSIX path of (choose file)
set numberedLines to do shell script "sed = " & quoted form of theFile & " | sed " & quoted form of "N;s/\\n/\\. /"
set numberedLines to paragraphs of numberedLines

https://edoras.sdsu.edu/doc/sed-oneliners.html

I think this is not another good solution, but the best of all the solutions proposed here in terms of its conciseness and speed.

I just tested it on bible.txt paragraph numbering and found it to be 1.25 times faster than my script.

Wow! I thought I knew a bit about sed, but this is really sneaky. :cool: The first sed command outputs a text with the line numbers and paragraphs on alternate lines. The second joins each pair of lines from that text in its pattern space and replaces the joining linefeed with ". ". Brilliant. :slight_smile:

sed only recognises linefeeds as line separators in its input, so to be able to handle text with LF, CR, or CRLF line endings, the shell script needs a further initial stage. Here, any returns at the ends of what sed thinks are linefeed-delimited lines are zapped and any other returns are replaced with linefeeds:

set theFile to POSIX path of (choose file)
set numberedLines to do shell script "sed -E 's/'$'\\r''$//;s/'$'\\r''/\\'$'\\n''/g' " & ¬
	quoted form of theFile & " | sed =  | sed " & quoted form of "N;s/\\n/\\. /"
set numberedLines to paragraphs of numberedLines

Yet another text editing utility! :o

I vaguely remembered having written some kind of counter with sed years ago and managed to find it here last night. It’s actually a one-pass line numbering script, written for the challenge in the context of the topic. It’s not quite as fast as the two-pass method peavine turned up and it’s only good for up to 999 lines, but it was fun working it out. Here it is again, revamped for the current context (including an additional pass at the beginning to standardise the line endings) and capable of going up to 99999.

set theFile to POSIX path of (choose file)

set cmd to "sed -E 's/'$'\\r''$//;s/'$'\\r''/\\'$'\\n''/g' " & ¬
	quoted form of theFile & " | sed -En '
1 {
	# Before editing the first line, create a counter and put it in the hold space. The 50 digits notionally constitute the five rotating wheels of a five-digit decimal meter, the first digit of each ten being the “visible” one. The initial reading is 00001. Set-up: copy the first line of the text to the hold space, replace it in the pattern space with the meter, swap the hold and pattern space contents.
	h
	s/^.*$/01234567890123456789012345678901234567891234567890/
	x
}
# Append the current line to the meter in the hold space and copy the result back to the pattern space.
H
g
# Lose the non-visible digits and substitute “. ” for the linefeed between the meter and the text line.
s/^(.).{9}(.).{9}(.).{9}(.).{9}(.).{9}\\n/\\1\\2\\3\\4\\5. /
# Lose any leading zeros and print what‘s left.
s/^0*//p
# Get the meter/line combination again from the hold space.
g
# Left-rotate units digits by one and lose the line text.
s/^(.{40})(.)(.{9}).+/\\1\\3\\2/
# If this makes the “visible” units digit 0, rotate the tens.
/^.{40}0/ {
	s/^(.{30})(.)(.{9})/\\1\\3\\2/
	# If this this makes the visible tens digit 0, rotate the hundreds.
	/^.{30}0/ {
		s/^(.{20})(.)(.{9})/\\1\\3\\2/
		# If this makes the visible hundreds digit 0, rotate the thousands.
		/^.{20}0/ {
			s/^(.{10})(.)(.{9})/\\1\\3\\2/
			# If this makes the the visible thousands digit 0, rotate the ten thousands.
			/^.{10}0/ s/^(.)(.{9})/\\2\\1/
		}
	}
}
# Overwrite the hold space with the updated meter.
h'"
set numberedLines to paragraphs of (do shell script cmd)