You might be better off using Perl which has these operators for regexes.
That’s the way I went on this, far less aggravation in the end.
The OP has decided on a solution and doesn’t need any more suggestions. However, I wanted to write a solution using basic AppleScript and decided to post it here FWIW. With test strings containing 33 and 1025 paragraphs, the timing results were 20 and 332 milliseconds, although the latter result could probably be reduced 90 percent or more by using script objects.
set theString to "this is a sentence. this is a sentence. this is a sentence.
this is a sentence. this is a sentence. this is a sentence.
this is a sentence. this is a sentence. this is a sentence."
set capitalizedString to getCapitalizedString(theString)
on getCapitalizedString(theString)
set theParagraphs to paragraphs of theString
set {TID, text item delimiters} to {text item delimiters, {". "}}
repeat with aParagraph in theParagraphs
set theSentences to text items of aParagraph
repeat with aSentence in theSentences
try
set theOffset to offset of (character 1 of aSentence) in "abcdefghijklmnopqrstuvwxyz"
set theCapitalizedCharacter to character theOffset of "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
set contents of aSentence to theCapitalizedCharacter & text 2 thru -1 of aSentence
end try
end repeat
set contents of aParagraph to theSentences as text
end repeat
set text item delimiters to linefeed
set capitalizedString to theParagraphs as text
set text item delimiters to TID
return capitalizedString as text
end getCapitalizedString
Modern sed can be case-insensitive.
Gnu-Sed or sed on macOS after Catalina (I think) support the insensitive switch.
(I’m stuck on Mojave and cannot verify that sed was updated, but I’ve been so informed.)
#!/usr/bin/env bash
STR='What a wonderful world it would be...'
gsed -E 's!WONDERFUL!WONDERFUL!I' <<< "$STR"
Of course this doesn’t help with the OP’s uppercase/lowercase issues.
I would stick with Perl for this job.
FWIW, I rewrote my basic AppleScript with ASObjC. With 50 or fewer paragraphs, the timing result was under 20 milliseconds, but with with 1025 paragraphs the result was 440 milliseconds.
use framework "Foundation"
use scripting additions
set theString to "this is a sentence. this is a Sentence. this is a sentence.
this is a sentence. this is a Sentence. this is a sentence.
this is a sentence. this is a Sentence. this is a sentence."
set capitalizedString to getCapitalizedString(theString)
on getCapitalizedString(theString)
set theString to current application's NSString's stringWithString:theString
set theParagraphs to (theString's componentsSeparatedByString:linefeed)
repeat with aParagraph in theParagraphs
set theSentences to (aParagraph's componentsSeparatedByString:". ")
repeat with aSentence in theSentences
set theWords to (aSentence's componentsSeparatedByString:" ")'s mutableCopy()
set firstWord to (theWords's objectAtIndex:0)'s capitalizedString()
(theWords's replaceObjectAtIndex:0 withObject:firstWord)
set contents of aSentence to (theWords's componentsJoinedByString:(" "))
end repeat
set contents of aParagraph to (theSentences's componentsJoinedByString:". ")
end repeat
return ((theParagraphs's componentsJoinedByString:linefeed) as text)
end getCapitalizedString
To lowercase the entire string except the first letter of the first word of each sentence, delete existing line 1 below and replace it with new line 2 below:
set theString to current application's NSString's stringWithString:theString
set theString to (current application's NSString's stringWithString:theString)'s lowercaseString()
I am stunned and humbled by your solutions. I truly have soooo much to learn. Your examples are really helpful. Thanks
This topic reminded me of a script I wrote some time ago which allows case-change codes to be used in replacement templates with ASObjC’s ICU regex. I’ve tidied it up and have just posted it in MacScripter’s Code Exchange forum.
Thanks for that I’ll have to sit down tonight and take a look at it.
Thanks Jeff
Nigel’s script provides a comprehensive NSRegularExpression solution to making case changes. The following is an NSRegularExpression solution specific to problem posed in this post that uses a slightly different approach than peavine’s. matchObjs is coded as a property to speed up execution of the repeat loop.
use framework "Foundation"
use scripting additions
property matchObjs : missing value
tell (current application's NSMutableString's stringWithString:selectedText) to set {strObj, strRange} to {it, current application's NSMakeRange(0, its |length|())}
set regexObj to (current application's NSRegularExpression's regularExpressionWithPattern:"(\\.\\s+[[:lower:]])" options:0 |error|:(missing value))
set my matchObjs to (regexObj's matchesInString:strObj options:0 range:strRange) as list
repeat with currMatchObj in my matchObjs
set currRange to currMatchObj's range()
set currSubstringObj to (strObj's substringWithRange:currRange)
(strObj's replaceCharactersInRange:currRange withString:(currSubstringObj's uppercaseString()))
end repeat
set selectedText to strObj as text
I’d suggest using [:lower:]
instead of [a-z]
because the former works also for accented characters. The latter will only match ASCII lowercase characters.
Thank you for that helpful suggestion, chrillek. I made the change.
I was made aware of a mistake in my previous NSRegularExpression solution, namely that while it capitalizes the first letter following a period and space characters, it doesn’t make the rest of the sentence lowercase, as the poster requested. The following modified NSRegularExpression solution corrects that problem.
The key to its functionality is the regular expression pattern
(?:^\s*|\.\s+)([^.])([^.]*)
As a whole, the pattern matches a single sentence, beginning with the period preceding the sentence (or the start of the input string) and extending to but not including the period at the end of the sentence. The regular expression pattern’s first component
(?:^\s*|\.\s+)
is a non-capturing group that matches either zero or more spaces at the start of the input string, or a period followed by one or more spaces. The second component
([^.])
is a capturing group that matches the first non-period character, i.e., the first character of the sentence. It corresponds to its match object’s rangeAtIndex:1. The third component
([^.]*)
is another capturing group that matches any subsequent number of non-period characters, i.e., the remaining characters of the sentence. It corresponds to its match object’s rangeAtIndex:2.
tell (current application's NSMutableString's stringWithString:selectedText) to set {strObj, strRange} to {it, current application's NSMakeRange(0, its |length|())}
set regexObj to (current application's NSRegularExpression's regularExpressionWithPattern:"(?:^\\s*|\\.\\s+)([^.])([^.]*)" options:0 |error|:(missing value))
set my matchObjs to (regexObj's matchesInString:strObj options:0 range:strRange) as list
repeat with currMatchObj in my matchObjs
set {sentenceFirstCharRange, sentenceRemainingCharsRange} to {currMatchObj's rangeAtIndex:1, currMatchObj's rangeAtIndex:2}
set {sentenceFirstChar, sentenceRemainingChars} to {strObj's substringWithRange:sentenceFirstCharRange, strObj's substringWithRange:sentenceRemainingCharsRange}
(strObj's replaceCharactersInRange:sentenceFirstCharRange withString:(sentenceFirstChar's uppercaseString()))
(strObj's replaceCharactersInRange:sentenceRemainingCharsRange withString:(sentenceRemainingChars's lowercaseString()))
end repeat
set selectedText to strObj as text
Using a modified version of peavine’s example,
" this is a sentence. this is a Sentence. this is a sentence.
this is a sentence. this is a Sentence. t.
u. this is a Sentence. this is a sentence."
becomes
" This is a sentence. This is a sentence. This is a sentence.
This is a sentence. This is a sentence. T.
U. This is a sentence. This is a sentence."
This will inevitably lower case proper names like Thomas or Susan and acronyms like UK and CPU, not to mention its change of
AppleScript etc. It might therefore be better to limit the code to only upper case the first letter.
chrillek, I agree completely. I posted the second version only to offer a possible solution to question as it was asked.
My take on it in pure JavaScript (no JXA):
const testStr = ` this is a sentence. this is a Sentence. this is a sentence.
this is a sentence. this is a Sentence. t.
u. this is a Sentence. this is a sentence.`;
(() => {
const RE = new RegExp("(^\\s*|[.!?]\\s+)([^.]{2,})","gms")
const result = testStr.replaceAll(RE, upperCase);
console.log(result);
})()
function upperCase(match, p1,p2) {
const uc = p2.substring(0,1).toLocaleUpperCase() + p2.substring(1);
return `${p1}${uc}`
}
Output
This is a sentence. This is a Sentence. This is a sentence.
This is a sentence. This is a Sentence. t.
u. This is a Sentence. This is a sentence.
That’s only an example, not necessarily the best one, to show how one can use a function in a call to replace
/replaceAll
to perform additional stuff. Here, it is uppercasing the first letter of the 2nd capturing group.
Differing from @bmose, I have to use a capturing group for the start of the string so I can output it again. I didn’t bother to down case uppercased words in the sentence, because that would potentially lead to too many mistakes, as mentioned here: Upper & Lower Case Sed Regex Problems - #18 by chrillek
Thank you for the JavaScript demo. I don’t know JavaScript but get the overall gist of what the script is doing.
Incidentally, when modifying text via NSRegularExpression as in my two examples above, I generally process the match objects returned by NSRegularExpression’s matchesInString:options:range: in reverse order so that any interim changes made to the NSMutableString object don’t break the ranges of match objects that are yet to be processed. I didn’t do so above because I didn’t expect that problem to arise in the specific examples shown. But generally I would recommend doing so. Thus, instead of :
set matchObjs to (regexObj’s matchesInString:strObj options:0 range:strRange) as list
repeat with currMatchObj in matchObjs
– [make changes to the mutable string object (strObj) using the current match object (currMatchObj)]
end repeat
I would recommend the following:
set matchObjsReversed to ((regexObj’s matchesInString:strObj options:0 range:strRange) as list)'s reverse
repeat with currMatchObj in matchObjsReversed
– [make changes to the mutable string object (strObj) using the current match object (currMatchObj)]
end repeat
Out of curiosity: are these match objects references/pointers into the original string? In Perl/JavaScript (and I support other languages with RE support as well) they are simply independent copies so that you can do with them whatever you want without harming other matches (or the original string).
Match objects are NSTextCheckingResult objects, which contain only range information and result type (an enum whose value = 1024 = NSTextCheckingTypeRegularExpression type, signifying that the match is a regular expression match). The range information consists of locations and lengths of matching substrings within the string at which you can make changes to the string. If you make a change that alters the length of the matching substring, and the next match’s ranges are to the right of that change, then the latter ranges will no longer point to the correct locations in the string. By processing the matches in reverse order, any changes that you make to a substring will not affect subsequent matches, because they point to locations to the left of the changes made in the string.
The following example is similar to the previous ones, except that it deletes all characters after the first character of a sentence. Matches are processed in reverse order:
set selectedText to "this is A SENtence. THIS is A sentencE.
this is A SENtence. THIS is A sentencE."
tell (current application's NSMutableString's stringWithString:selectedText) to set {strObj, strRange} to {it, current application's NSMakeRange(0, its |length|())}
set regexObj to (current application's NSRegularExpression's regularExpressionWithPattern:"(?:^\\s*|\\.\\s+)([^.])([^.]*)" options:0 |error|:(missing value))
set matchObjs to ((regexObj's matchesInString:strObj options:0 range:strRange) as list)'s reverse -- REVERSE ORDER
repeat with currMatchObj in matchObjs
set {sentenceFirstCharRange, sentenceRemainingCharsRange} to {currMatchObj's rangeAtIndex:1, currMatchObj's rangeAtIndex:2}
set {sentenceFirstChar, sentenceRemainingChars} to {strObj's substringWithRange:sentenceFirstCharRange, strObj's substringWithRange:sentenceRemainingCharsRange}
(strObj's replaceCharactersInRange:sentenceFirstCharRange withString:(sentenceFirstChar's uppercaseString()))
(strObj's replaceCharactersInRange:sentenceRemainingCharsRange withString:"") -- DELETES ALL BUT THE FIRST CHARACTER IN THE CURRENT SENTENCE
end repeat
set selectedText to strObj as text
-->
"T. T.
T. T."
If an attempt is made to process matches in forward order, a range out-of-bounds error occurs:
set selectedText to "this is A SENtence. THIS is A sentencE.
this is A SENtence. THIS is A sentencE."
tell (current application's NSMutableString's stringWithString:selectedText) to set {strObj, strRange} to {it, current application's NSMakeRange(0, its |length|())}
set regexObj to (current application's NSRegularExpression's regularExpressionWithPattern:"(?:^\\s*|\\.\\s+)([^.])([^.]*)" options:0 |error|:(missing value))
set matchObjs to (regexObj's matchesInString:strObj options:0 range:strRange) as list -- FORWARD ORDER
repeat with currMatchObj in matchObjs
set {sentenceFirstCharRange, sentenceRemainingCharsRange} to {currMatchObj's rangeAtIndex:1, currMatchObj's rangeAtIndex:2}
set {sentenceFirstChar, sentenceRemainingChars} to {strObj's substringWithRange:sentenceFirstCharRange, strObj's substringWithRange:sentenceRemainingCharsRange}
(strObj's replaceCharactersInRange:sentenceFirstCharRange withString:(sentenceFirstChar's uppercaseString()))
(strObj's replaceCharactersInRange:sentenceRemainingCharsRange withString:"") -- DELETES ALL BUT THE FIRST CHARACTER IN THE CURRENT SENTENCE
end repeat
set selectedText to strObj as text
-->
-- ERROR: -[__NSCFString substringWithRange:]: Range {41, 17} out of bounds; string length 45
Thanks for the explanation. That’s indeed a very different concept from the ones I’m used to where the original string is not modified and the start/end values for the matches thus don’t change.
Why not awk?
set teststr to "5 is a number. this is testing the code. it shoould
capitalize both \".\" & newline and \".\" and space. not to worry. we are testing.
a is character.
\"a\" is lowercase:-)
. starting line with \".\""
log capitalize(teststr)
on capitalize(s as string)
set awkprog to "'
BEGIN{
p=1
}
{
if(p) $0=toupper(substr($0,1,1)) substr($0,2)
s=\"\"
while(off=index($0, \". \")){
s=s substr($0,1, off+1) toupper(substr($0,off+2,1))
$0=substr($0,off+3)
}
print s $0
if(substr($NF,length($NF),1)==\".\") p=1
else p=0
}'"
return (do shell script "echo " & quoted form of s & "| awk " & awkprog)
end capitalize