Assuming the source text contains only strings and positive or negative integers:
extractIntegersFromText("One two three -1234 four five 890 six 56 seven eight 989775nine.")
on extractIntegersFromText(theText)
set integersList to paragraphs of ¬
(do shell script "echo " & quoted form of theText & " | grep -o -E '[+-]?[0-9]+'")
set ATID to AppleScript's text item delimiters
set AppleScript's text item delimiters to ","
set integersList to run script ("{" & integersList & "}")
set AppleScript's text item delimiters to ATID
return integersList
end extractIntegersFromText
NOTE:
The negative integers will be detected only for minus (-) sign, followed by digits (no spaces, or other non-digital characters).
In general, I don’t – no need to “return” a result when I’m working in JavaScript (as this is – just pure JavaScript, no JXA involved).
I guess that you mean: If I want to call this from an AppleScript script, how do I get the result into my script?
Like so:
set scriptCode to "[...'One two three 1234 four five 890 six 56 seven eight 989775nine.'.matchAll(/\\d+/g)].map(x => +x)"
set result to run script scriptCode in "JavaScript"
or so
set txt to "One two three 1234 four five 890 six 56 seven eight 989775nine.";
set scriptCode to "[... '" & txt&"'.matchAll(/\\d+/g)].map(x => +x)"
set result to run script scriptCode in "JavaScript"
Explanation:
matchAllwith a regular expression (\d+ here) and the global flag set returns an “iterator” of arrays: each element contains an element for each of the submatches. In this case, each inner array contains simply the matched string: [“1234”], [“890”], [“56”], [“98775”]. Kind of a list of lists in AppleScript, but not quite.
We want that as a real array, which is achieved by [... ]. That gives us [“1234”, “890”, “56”, “98775”] – an array of strings
But we want an array of integers, so map builds a new array out of the old one by converting each element to a string, which is achieved by prepending a ‘+’ (that’s lingo, one could also use a method call here)
Note that one has to use \\d+ here instead of \d+ as is customary in JavaScript. That’s because it is part of an AppleScript string, which needs an escaped backslash to preserve it. There’s no need to explicitly return anything, since the result of the JavaScript is the result of the last expression. And JavaScript arrays are automagically converted to AppleScript lists, so that the result in AS is {1234, 890, 56, 98775}.
The difference between the two scripts is only in where the text is built: The first script makes it part of the JavaScript string, the second one sets in AppleScript and then concatenates it into the JavaScript code. In this case, the single quotes must be added before and after the txt parameter.
JavaScript knows of three ways to quote strings: Single quote, double quote and back quote (for interpolated strings). That makes it fairly easy to build JS code in AS: Include all in double quotes and use single or back quotes in the JavaScript code. Also, quoted form of is not needed here.
This is similar to KniazidisR’s script at the top, but uses sed instead of grep + AS.
extractIntegersFromText("One 1.2 three 1234 four,
five 890 six 56 seven 1.23E+4 eight 989775nine.
Another line.")
on extractIntegersFromText(theText)
set integersList to ¬
(do shell script "echo " & quoted form of theText & " | sed -En '
s/[0-9]+[.,][0-9]+([Ee][+-]?[0-9]+)?|[^0-9]+/,/g ; # Replace reals and non-digit runs with commas.
H ; # Append this line to the hold space.
$ { # If this is the last line:
g ; # Retrieve the text from the hold space.
s/\\n//g ; # Zap the linefeed(s).
s/,,*/,/g ; # Replace any runs of commas with single instances.
s/^,?/{/; # Lose any spare comma at the beginning and prepend {.
s/,?$/}/p ; # Ditto at the end, append }, and print.
} ;'")
return (run script integersList)
end extractIntegersFromText
use framework "Foundation"
use scripting additions
set theString to "One two three 1234 four five 890 six 56 seven eight 989775nine."
set theString to current application's NSMutableString's stringWithString:theString
set thePattern to "\\D+" -- non-digit characters
(theString's replaceOccurrencesOfString:thePattern withString:" " options:1024 range:{0, theString's |length|()})
set thePattern to "^\\s+|\\s+$" -- white space characters
(theString's replaceOccurrencesOfString:thePattern withString:"" options:1024 range:{0, theString's |length|()})
return ((theString's componentsSeparatedByString:space)'s valueForKey:"integerValue") as list
--> {1234, 890, 56, 989775}
It’s a bit slower, although in this case the difference is less than a millisecond ( 0.4 versus 0.7 millisecond with the Foundation framework in memory):
use framework "Foundation"
use scripting additions
set theString to "One two three 1234 four five 890 six 56 seven eight 989775nine."
set theString to current application's NSString's stringWithString:theString
set thePattern to "\\d+"
set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
set regexResults to theRegex's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
set theRanges to (regexResults's valueForKey:"range")
set theMatches to current application's NSMutableArray's new()
repeat with aRange in theRanges
(theMatches's addObject:(theString's substringWithRange:aRange))
end repeat
return (theMatches's valueForKey:"integerValue") as list
set {ptids, integersList, text item delimiters} to {text item delimiters, paragraphs of (do shell script "echo " & quoted form of theText & " | grep -o -E '[0-9]+'"), ","}
set {integersList, text item delimiters} to {run script ("{" & integersList & "}"), ptids}
And here’s another AS way to skin this cat.
on extractIntegersFromText2(theText)
set text item delimiters to characters of "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
set {theText, text item delimiters} to {text items of theText, ""}
set {theText, text item delimiters} to {theText as text, ","}
run script ("{" & (words of (theText as text)) as text) & "}"
end extractIntegersFromText2
Can any of the methods posted to date handle negative integers?
By modifying the regular expression like so -?\d+ that should be easy: an optional minus sign followed by any number of digits. Or even [-+]?\d+ to accommodate an optional plus sign as well.
How does that work with a string like “123 446 $22”? (The code gives an error here, probably because of a non-matching parenthesis on the run script line).
Also, your first variant puts out a trailing comma.
I see (not that I’d care about the performance here).
The logic imposed by NSRegularExpression is really a bit contrived – the right thing would be to return the matched string in the matchesInString result as well. Instead of doing that, they force the programmer to extract the matches manually.
I tried to increase the size of the test string, but both KniazidisR’s and Nigel’s scripts returned a stack overflow error. I don’t know the reason for this, but it might have something to do with the test script.
The test script with Nigel’s suggestion:
use framework "Foundation"
use scripting additions
-- untimed code
set theString to "One two three 1234 four five 890 six 56 seven eight 989775nine." & linefeed
repeat 10 times
set theString to theString & theString
end repeat
-- start time
set startTime to current application's CACurrentMediaTime()
-- timed code
set theIntegers to extractIntegersFromText(theString)
on extractIntegersFromText(theText)
set integersList to ¬
(do shell script "echo " & quoted form of theText & " | sed -En '
s/[0-9]+\\.[0-9]+([Ee][+-]?[0-9]+)?|[^0-9]+/,/g ; # Replace reals and non-digit runs with commas.
H ; # Append this line to the hold space.
$ { # If this is the last line:
g ; # Retrieve the text from the hold space.
s/\\n//g ; # Zap the linefeed(s).
s/,,*/,/g ; # Replace any runs of commas with single instances.
s/^,?/{/; # Lose any spare comma at the beginning and prepend {.
s/,?$/}/p ; # Ditto at the end, append }, and print.
} ;'")
return (run script integersList)
end extractIntegersFromText
-- elapsed time
set elapsedTime to (current application's CACurrentMediaTime()) - startTime
set numberFormatter to current application's NSNumberFormatter's new()
if elapsedTime > 1 then
numberFormatter's setFormat:"0.000"
set elapsedTime to ((numberFormatter's stringFromNumber:elapsedTime) as text) & " seconds"
else
(numberFormatter's setFormat:"0")
set elapsedTime to ((numberFormatter's stringFromNumber:(elapsedTime * 1000)) as text) & " milliseconds"
end if
-- result
elapsedTime --> 71 milliseconds
# count paragraphs of theString --> 1025
# count theIntegers --> 4096
BTW, I didn’t test the JavaScript suggestions because they won’t run in the test script. I’m sure they would be as fast or faster than the ASObjC solution.
“Words of” disregards most punctuation but not $, Add any desired non-numeric characters to the tids definition. I have a version that discovers all non-numeric input characters and adds those to the tids but I didn’t post that version as I was just showing another approach.
on extractIntegersFromText(theText)
set theAlphabet to characters of " ABCDEFGHIJKLMNOPQRSTUVWXYZ"
set text item delimiters to {"1", "2", "3", "4", "5", "6", "7", "8", "9", "0"}
set {textCopy, text item delimiters} to {(text items of theText), ""}
set {textCopy, text item delimiters} to {textCopy as text, theAlphabet}
set {textCopy, text item delimiters} to {text items of textCopy, ""}
set text item delimiters to theAlphabet & (textCopy as text)
set {theText, text item delimiters} to {text items of theText, ","}
set theText to theText as text
run script "{" & ((words of (theText as text)) as text) & "}"
end extractIntegersFromText
I don’t see a trailing comma generated from the two-line code. What input are you using that does?
I think there’s a limit to how much text can be included in the text of a ‘do shell script’ command, but I don’t remember how much that is now. In both KniazidisR’s script and mine the text to be parsed is included in the command text, so if it’s one of your extreme length efforts, that might be the reason. If the text is saved to a file and read from there by the running shell script, it can much longer.
This is a minor variation on @peavine’s first ASObjC script above. The differences are that the first regex pattern, like my sed script, also weeds out any “reals” represented in the text, which would otherwise be identified as two separate integers, and the second pattern contains an additional search term to catch any extra internal spaces caused by the first. As in the other scripts, it’s assumed that none of the numbers contain grouping separators.
use framework "Foundation"
use scripting additions
set theString to "One 1.2 three 1234 four five 890 six 56 seven eight 989775nine."
set theString to current application's NSMutableString's stringWithString:theString
set thePattern to "\\D++|\\d++[.,]\\d++(?:[Ee][+-]?\\d++)?" -- non-digit characters and reals
(theString's replaceOccurrencesOfString:thePattern withString:" " options:1024 range:{0, theString's |length|()})
set thePattern to "(?<=^| ) ++| ++$" -- spaces
(theString's replaceOccurrencesOfString:thePattern withString:"" options:1024 range:{0, theString's |length|()})
return ((theString's componentsSeparatedByString:space)'s valueForKey:"integerValue") as list
--> {1234, 890, 56, 989775}
I ran KniazidisR’s script with a 111 kb file (using ‘cat’ rather than ‘echo’ to ingest it) and it resulted in a stack overflow error (-2706) at this point:
run script ("{" & integersList & "}")
In the log history, the long list that is generated includes the entirety of the text’s numerals.
When I run the script with a smaller text of 83 kb, the error does not occur. FWIW, both texts are csvs from sports stats sites… a full season’s NFL schedule and a full season of hitting stats for baseball, so lots of numbers to work with. When saving the long list, the resulting text files are 77 kb (from 111) and 13 kb (from 83).
I know argmax affects approximately how many bytes (not necessarily how many characters) can be used in the shell script text in a ‘do shell script’ command, but I’ve no idea if it applies to ‘run script’ as well.