Upper & Lower Case Sed Regex Problems

chrillek · March 27, 2023, 2:27pm

You might be better off using Perl which has these operators for regexes.

jpottsx1 · March 27, 2023, 2:33pm

That’s the way I went on this, far less aggravation in the end.

peavine · March 27, 2023, 9:38pm

The OP has decided on a solution and doesn’t need any more suggestions. However, I wanted to write a solution using basic AppleScript and decided to post it here FWIW. With test strings containing 33 and 1025 paragraphs, the timing results were 20 and 332 milliseconds, although the latter result could probably be reduced 90 percent or more by using script objects.

set theString to "this is a sentence. this is a sentence. this is a sentence.
this is a sentence. this is a sentence. this is a sentence.
this is a sentence. this is a sentence. this is a sentence."

set capitalizedString to getCapitalizedString(theString)

on getCapitalizedString(theString)
	set theParagraphs to paragraphs of theString
	set {TID, text item delimiters} to {text item delimiters, {". "}}
	repeat with aParagraph in theParagraphs
		set theSentences to text items of aParagraph
		repeat with aSentence in theSentences
			try
				set theOffset to offset of (character 1 of aSentence) in "abcdefghijklmnopqrstuvwxyz"
				set theCapitalizedCharacter to character theOffset of "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
				set contents of aSentence to theCapitalizedCharacter & text 2 thru -1 of aSentence
			end try
		end repeat
		set contents of aParagraph to theSentences as text
	end repeat
	set text item delimiters to linefeed
	set capitalizedString to theParagraphs as text
	set text item delimiters to TID
	return capitalizedString as text
end getCapitalizedString

ccstone · March 28, 2023, 1:57am

Modern sed can be case-insensitive.

Gnu-Sed or sed on macOS after Catalina (I think) support the insensitive switch.

(I’m stuck on Mojave and cannot verify that sed was updated, but I’ve been so informed.)

#!/usr/bin/env bash

STR='What a wonderful world it would be...'
gsed -E 's!WONDERFUL!WONDERFUL!I' <<< "$STR"

Of course this doesn’t help with the OP’s uppercase/lowercase issues.

I would stick with Perl for this job.

peavine · March 28, 2023, 2:20am

FWIW, I rewrote my basic AppleScript with ASObjC. With 50 or fewer paragraphs, the timing result was under 20 milliseconds, but with with 1025 paragraphs the result was 440 milliseconds.

use framework "Foundation"
use scripting additions

set theString to "this is a sentence. this is a Sentence. this is a sentence.
this is a sentence. this is a Sentence. this is a sentence.
this is a sentence. this is a Sentence. this is a sentence."

set capitalizedString to getCapitalizedString(theString)

on getCapitalizedString(theString)
	set theString to current application's NSString's stringWithString:theString
	set theParagraphs to (theString's componentsSeparatedByString:linefeed)
	repeat with aParagraph in theParagraphs
		set theSentences to (aParagraph's componentsSeparatedByString:". ")
		repeat with aSentence in theSentences
			set theWords to (aSentence's componentsSeparatedByString:" ")'s mutableCopy()
			set firstWord to (theWords's objectAtIndex:0)'s capitalizedString()
			(theWords's replaceObjectAtIndex:0 withObject:firstWord)
			set contents of aSentence to (theWords's componentsJoinedByString:(" "))
		end repeat
		set contents of aParagraph to (theSentences's componentsJoinedByString:". ")
	end repeat
	return ((theParagraphs's componentsJoinedByString:linefeed) as text)
end getCapitalizedString

To lowercase the entire string except the first letter of the first word of each sentence, delete existing line 1 below and replace it with new line 2 below:

set theString to current application's NSString's stringWithString:theString
set theString to (current application's NSString's stringWithString:theString)'s lowercaseString()

jpottsx1 · March 28, 2023, 12:44pm

I am stunned and humbled by your solutions. I truly have soooo much to learn. Your examples are really helpful. Thanks

Nigel_Garvey · March 28, 2023, 2:14pm

This topic reminded me of a script I wrote some time ago which allows case-change codes to be used in replacement templates with ASObjC’s ICU regex. I’ve tidied it up and have just posted it in MacScripter’s Code Exchange forum.

jpottsx1 · March 31, 2023, 5:47pm

Thanks for that I’ll have to sit down tonight and take a look at it.

Thanks Jeff

bmose · March 31, 2023, 6:30pm

Nigel’s script provides a comprehensive NSRegularExpression solution to making case changes. The following is an NSRegularExpression solution specific to problem posed in this post that uses a slightly different approach than peavine’s. matchObjs is coded as a property to speed up execution of the repeat loop.

use framework "Foundation"
use scripting additions

property matchObjs : missing value

tell (current application's NSMutableString's stringWithString:selectedText) to set {strObj, strRange} to {it, current application's NSMakeRange(0, its |length|())}
set regexObj to (current application's NSRegularExpression's regularExpressionWithPattern:"(\\.\\s+[[:lower:]])" options:0 |error|:(missing value))
set my matchObjs to (regexObj's matchesInString:strObj options:0 range:strRange) as list
repeat with currMatchObj in my matchObjs
	set currRange to currMatchObj's range()
	set currSubstringObj to (strObj's substringWithRange:currRange)
	(strObj's replaceCharactersInRange:currRange withString:(currSubstringObj's uppercaseString()))
end repeat

set selectedText to strObj as text

chrillek · March 31, 2023, 8:33pm

I’d suggest using [:lower:] instead of [a-z] because the former works also for accented characters. The latter will only match ASCII lowercase characters.

bmose · March 31, 2023, 9:00pm

Thank you for that helpful suggestion, chrillek. I made the change.

bmose · April 1, 2023, 5:29am

I was made aware of a mistake in my previous NSRegularExpression solution, namely that while it capitalizes the first letter following a period and space characters, it doesn’t make the rest of the sentence lowercase, as the poster requested. The following modified NSRegularExpression solution corrects that problem.

The key to its functionality is the regular expression pattern

(?:^\s*|\.\s+)([^.])([^.]*)

As a whole, the pattern matches a single sentence, beginning with the period preceding the sentence (or the start of the input string) and extending to but not including the period at the end of the sentence. The regular expression pattern’s first component

(?:^\s*|\.\s+)

is a non-capturing group that matches either zero or more spaces at the start of the input string, or a period followed by one or more spaces. The second component

([^.])

is a capturing group that matches the first non-period character, i.e., the first character of the sentence. It corresponds to its match object’s rangeAtIndex:1. The third component

([^.]*)

is another capturing group that matches any subsequent number of non-period characters, i.e., the remaining characters of the sentence. It corresponds to its match object’s rangeAtIndex:2.

tell (current application's NSMutableString's stringWithString:selectedText) to set {strObj, strRange} to {it, current application's NSMakeRange(0, its |length|())}
set regexObj to (current application's NSRegularExpression's regularExpressionWithPattern:"(?:^\\s*|\\.\\s+)([^.])([^.]*)" options:0 |error|:(missing value))
set my matchObjs to (regexObj's matchesInString:strObj options:0 range:strRange) as list
repeat with currMatchObj in my matchObjs
	set {sentenceFirstCharRange, sentenceRemainingCharsRange} to {currMatchObj's rangeAtIndex:1, currMatchObj's rangeAtIndex:2}
	set {sentenceFirstChar, sentenceRemainingChars} to {strObj's substringWithRange:sentenceFirstCharRange, strObj's substringWithRange:sentenceRemainingCharsRange}
	(strObj's replaceCharactersInRange:sentenceFirstCharRange withString:(sentenceFirstChar's uppercaseString()))
	(strObj's replaceCharactersInRange:sentenceRemainingCharsRange withString:(sentenceRemainingChars's lowercaseString()))
end repeat
set selectedText to strObj as text

Using a modified version of peavine’s example,

" this is a sentence. this is a Sentence. this is a sentence.
this is a sentence. this is a Sentence. t.
u. this is a Sentence. this is a sentence."

becomes

" This is a sentence. This is a sentence. This is a sentence.
This is a sentence. This is a sentence. T.
U. This is a sentence. This is a sentence."

chrillek · April 1, 2023, 6:30am

This will inevitably lower case proper names like Thomas or Susan and acronyms like UK and CPU, not to mention its change of
AppleScript etc. It might therefore be better to limit the code to only upper case the first letter.

bmose · April 1, 2023, 8:08am

chrillek, I agree completely. I posted the second version only to offer a possible solution to question as it was asked.

chrillek · April 1, 2023, 3:08pm

My take on it in pure JavaScript (no JXA):

const testStr = ` this is a sentence. this is a Sentence. this is a sentence.
this is a sentence. this is a Sentence. t.
u. this is a Sentence. this is a sentence.`;
(() => {
  const RE = new RegExp("(^\\s*|[.!?]\\s+)([^.]{2,})","gms")
  const result = testStr.replaceAll(RE, upperCase);
  console.log(result);
})()

function upperCase(match, p1,p2) {
	const uc = p2.substring(0,1).toLocaleUpperCase() + p2.substring(1);
	return `${p1}${uc}`
}

Output

 This is a sentence. This is a Sentence. This is a sentence.
This is a sentence. This is a Sentence. t.
u. This is a Sentence. This is a sentence.

That’s only an example, not necessarily the best one, to show how one can use a function in a call to replace/replaceAll to perform additional stuff. Here, it is uppercasing the first letter of the 2nd capturing group.

Differing from @bmose, I have to use a capturing group for the start of the string so I can output it again. I didn’t bother to down case uppercased words in the sentence, because that would potentially lead to too many mistakes, as mentioned here: Upper & Lower Case Sed Regex Problems - #18 by chrillek

bmose · April 1, 2023, 11:32pm

Thank you for the JavaScript demo. I don’t know JavaScript but get the overall gist of what the script is doing.

Incidentally, when modifying text via NSRegularExpression as in my two examples above, I generally process the match objects returned by NSRegularExpression’s matchesInString:options:range: in reverse order so that any interim changes made to the NSMutableString object don’t break the ranges of match objects that are yet to be processed. I didn’t do so above because I didn’t expect that problem to arise in the specific examples shown. But generally I would recommend doing so. Thus, instead of :

set matchObjs to (regexObj’s matchesInString:strObj options:0 range:strRange) as list
repeat with currMatchObj in matchObjs
– [make changes to the mutable string object (strObj) using the current match object (currMatchObj)]
end repeat

I would recommend the following:

set matchObjsReversed to ((regexObj’s matchesInString:strObj options:0 range:strRange) as list)'s reverse
repeat with currMatchObj in matchObjsReversed
– [make changes to the mutable string object (strObj) using the current match object (currMatchObj)]
end repeat

chrillek · April 2, 2023, 8:10am

Out of curiosity: are these match objects references/pointers into the original string? In Perl/JavaScript (and I support other languages with RE support as well) they are simply independent copies so that you can do with them whatever you want without harming other matches (or the original string).

bmose · April 2, 2023, 9:17am

Match objects are NSTextCheckingResult objects, which contain only range information and result type (an enum whose value = 1024 = NSTextCheckingTypeRegularExpression type, signifying that the match is a regular expression match). The range information consists of locations and lengths of matching substrings within the string at which you can make changes to the string. If you make a change that alters the length of the matching substring, and the next match’s ranges are to the right of that change, then the latter ranges will no longer point to the correct locations in the string. By processing the matches in reverse order, any changes that you make to a substring will not affect subsequent matches, because they point to locations to the left of the changes made in the string.
The following example is similar to the previous ones, except that it deletes all characters after the first character of a sentence. Matches are processed in reverse order:

set selectedText to "this is A SENtence. THIS is A sentencE.
this is A SENtence. THIS is A sentencE."
tell (current application's NSMutableString's stringWithString:selectedText) to set {strObj, strRange} to {it, current application's NSMakeRange(0, its |length|())}
set regexObj to (current application's NSRegularExpression's regularExpressionWithPattern:"(?:^\\s*|\\.\\s+)([^.])([^.]*)" options:0 |error|:(missing value))
set matchObjs to ((regexObj's matchesInString:strObj options:0 range:strRange) as list)'s reverse -- REVERSE ORDER
repeat with currMatchObj in matchObjs
	set {sentenceFirstCharRange, sentenceRemainingCharsRange} to {currMatchObj's rangeAtIndex:1, currMatchObj's rangeAtIndex:2}
	set {sentenceFirstChar, sentenceRemainingChars} to {strObj's substringWithRange:sentenceFirstCharRange, strObj's substringWithRange:sentenceRemainingCharsRange}
	(strObj's replaceCharactersInRange:sentenceFirstCharRange withString:(sentenceFirstChar's uppercaseString()))
	(strObj's replaceCharactersInRange:sentenceRemainingCharsRange withString:"") -- DELETES ALL BUT THE FIRST CHARACTER IN THE CURRENT SENTENCE
end repeat
set selectedText to strObj as text
-->
"T. T.
T. T."

If an attempt is made to process matches in forward order, a range out-of-bounds error occurs:

set selectedText to "this is A SENtence. THIS is A sentencE.
this is A SENtence. THIS is A sentencE."
tell (current application's NSMutableString's stringWithString:selectedText) to set {strObj, strRange} to {it, current application's NSMakeRange(0, its |length|())}
set regexObj to (current application's NSRegularExpression's regularExpressionWithPattern:"(?:^\\s*|\\.\\s+)([^.])([^.]*)" options:0 |error|:(missing value))
set matchObjs to (regexObj's matchesInString:strObj options:0 range:strRange) as list -- FORWARD ORDER
repeat with currMatchObj in matchObjs
	set {sentenceFirstCharRange, sentenceRemainingCharsRange} to {currMatchObj's rangeAtIndex:1, currMatchObj's rangeAtIndex:2}
	set {sentenceFirstChar, sentenceRemainingChars} to {strObj's substringWithRange:sentenceFirstCharRange, strObj's substringWithRange:sentenceRemainingCharsRange}
	(strObj's replaceCharactersInRange:sentenceFirstCharRange withString:(sentenceFirstChar's uppercaseString()))
	(strObj's replaceCharactersInRange:sentenceRemainingCharsRange withString:"") -- DELETES ALL BUT THE FIRST CHARACTER IN THE CURRENT SENTENCE
end repeat
set selectedText to strObj as text
-->
-- ERROR:  -[__NSCFString substringWithRange:]: Range {41, 17} out of bounds; string length 45

chrillek · April 2, 2023, 10:19am

Thanks for the explanation. That’s indeed a very different concept from the ones I’m used to where the original string is not modified and the start/end values for the matches thus don’t change.

Hallenstal · April 4, 2023, 11:01am

Why not awk?

set teststr to "5 is a number. this is testing the code. it shoould 
capitalize both \".\" & newline and  \".\" and space. not to worry. we are testing.
a is character.  
\"a\" is lowercase:-)
. starting line with \".\""

log capitalize(teststr)

on capitalize(s as string)
	set awkprog to "'
BEGIN{
    p=1
}
{
    if(p) $0=toupper(substr($0,1,1)) substr($0,2)
    s=\"\"
    while(off=index($0, \". \")){
        s=s substr($0,1, off+1) toupper(substr($0,off+2,1))
        $0=substr($0,off+3)    
    }
    print s $0
    if(substr($NF,length($NF),1)==\".\") p=1
    else p=0
}'"
	return (do shell script "echo  " & quoted form of s & "| awk " & awkprog)
end capitalize