Find and replace 'n' consecutive all-caps words in string

applesource · March 28, 2025, 4:53pm

I have body text in documents where I need to identify and replace sections with all-caps.

I want to write a func that takes a string (str) and integer (threshold) and replaces consecutive all-caps words with lowercase when threshold is met (meaning ‘threshold’ number of consecutive words in all-caps).

I’m currently using this func which I found in a Macscripter thread for regex search/replace, which works great:

on replacePattern:thePattern inString:theString usingThis:theTemplate
	set theRegEx to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
	set theResult to theRegEx's stringByReplacingMatchesInString:theString options:0 range:{location:0, |length|:length of theString} withTemplate:theTemplate
	return theResult as text
end replacePattern:inString:usingThis:

I use BBEdit to test my regex expressions before coding them in Applescript.

But, I’m at a loss to construct a regex that can match a random number of consecutive all-caps words, let alone replace them with lowercase. This is despite Google searches on stackoverflow and elsewhere with code examples. Not sure if this is a problem with BBEdit’s version of grep, or the grep expression itself.

Any help here would be appreciated.

peavine · March 28, 2025, 7:36pm

applesource. I’m a bit confused by your request, so the following is FWIW. The regex pattern ([A-Z]{2,}) works on two or more consecutive uppercase letters in the range of A to Z. This can be modified to enforce word boundaries and to work with different letters (and probably languages) and numbers of letters. I don’t know anything about BBEdit, if that’s what you’re asking about.

The timing result with a string that contained 1863 words was 50 milliseconds. This assumes that the Foundation framework is in memory, which would normally be the case.

use framework "Foundation"
use scripting additions

set theString to "This is SOME teXT and some IS in Uppercase"
set editedString to getEditedString(theString) -->"This is some text and some is in Uppercase"

on getEditedString(theString)
	set editedString to current application's NSMutableString's stringWithString:theString
	set theRegEx to current application's NSRegularExpression's regularExpressionWithPattern:"[A-Z]{2,}" options:0 |error|:(missing value)
	set regexResults to theRegEx's matchesInString:editedString options:0 range:{location:0, |length|:editedString's |length|()}
	set theRanges to (regexResults's valueForKey:"range")
	repeat with aRange in theRanges
		set lowercaseSubstring to (editedString's substringWithRange:aRange)'s lowercaseString()
		(editedString's replaceCharactersInRange:aRange withString:lowercaseSubstring)
	end repeat
	return editedString as text
end getEditedString

VikingOSX · March 28, 2025, 8:07pm

And just as confused, here is a script that demotes n consecutive capital letters in a word using the ICU Regex that Objective-C, and Swift use but not BBEdit.

Given the string “This is a stuTTering CAPitals exampLE.”

use framework "Foundation"
use scripting additions

property ca : current application

set s to ca's NSMutableString's stringWithString:"this is a stuTTering CAPitals exampLE."
set CapsInString to 2 as integer
set thePat to "[A-Z]{" & CapsInString & "}(?<![A-Z]{" & CapsInString + 1 & "})(?![A-Z])"
log (thePat) as text
display dialog (my replacePattern:thePat inString:s withConsecutiveCaps:CapsInString)
return


on replacePattern:thePattern inString:theString withConsecutiveCaps:n
	set s to ca's NSMutableString's stringWithString:theString
	set theRegEx to ca's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
	set srange to ca's NSMakeRange(0, theString's |length|())
	set matches to theRegEx's matchesInString:theString options:0 range:srange
	if (count of matches) > 0 then
		repeat with amatch in matches
			set matchedText to (s's substringWithRange:(amatch's range))
			log (count of (matchedText as text)) as integer
			(s's replaceOccurrencesOfString:matchedText withString:(matchedText's localizedLowercaseString))
			
		end repeat
		return (s) as text
	else
		return "No string match"
	end if
end replacePattern:inString:withConsecutiveCaps:

Will display a string with two consecutive capitals replaced by lowercase.

Screenshot 2025-03-28 at 4.06.37 PM

applesource · March 28, 2025, 8:48pm

Thanks, will check out these examples.

But, as my OP says:

But, I’m at a loss to construct a regex that can match a random number of consecutive all-caps words,

So the ‘threshold’ var mentioned in my func denotes a minimum number of consecutive all-cap words required, before all words in that group would be converted to lowercase. I.e.: if threshold = 10

find group of, minimum 10 consecutive all-caps words before transforming.

paulskinner · March 28, 2025, 9:57pm

I love a puzzle. This is a puzzle where the puzzle itself has to be puzzled out.

I read this as "I want to lowercase ONLY CONSECUTIVE ALL-CAPS WORDS in a given string WHERE the number of CONSECUTIVE ALL-CAPS WORDS in the string is greater than a threshold integer value, leaving all other text, including ALL_CAPS words that do not reach the consecutive-all-caps-word threshold. whew!

Un-Optimized code…

--3/28/25 https://www.macscripter.net/u/paulskinner/
use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

set theString to "This SENTENCE has THREE all caps WORDS IN A row"
set theThreshold to 3
Lowercase_Strings_Whose_All_Caps_Words_Exceed_Count(theString, theThreshold)



on Lowercase_Strings_Whose_All_Caps_Words_Exceed_Count(theString, theThreshold)
	set wordsToLowercase to {}
	set ConsecutiveAllCapsWordCount to 0
	set wordsOfTheString to words of theString
	set AppleScript's text item delimiters to characters of "abcdefghijklmnopqrstuvwxyz"
	repeat with i from 1 to length of wordsOfTheString
		set currentWord to (item i of wordsOfTheString) as text
		considering case
			set testWord to item 1 of (text items of currentWord)
		end considering
		if testWord is currentWord then
			set ConsecutiveAllCapsWordCount to ConsecutiveAllCapsWordCount + 1
		else
			set ConsecutiveAllCapsWordCount to 0
		end if
		if ConsecutiveAllCapsWordCount = theThreshold then
			--set the last theThreshold words to lowercase
			repeat with x from theThreshold to 1 by -1
				set the end of wordsToLowercase to (-1 * (x - i)) + 1
			end repeat
			set ConsecutiveAllCapsWordCount to 0
		end if
	end repeat
	set theOutputString to ""
	repeat with i from 1 to length of wordsOfTheString
		set currentWord to (item i of wordsOfTheString) as text
		if wordsToLowercase contains i then
			set theOutputString to theOutputString & " " & String_Case_Lower(currentWord)
		else
			set theOutputString to theOutputString & " " & currentWord
		end if
	end repeat
	return theOutputString
end Lowercase_Strings_Whose_All_Caps_Words_Exceed_Count

on String_Case_Lower(sourceText)
	try
		return (current application's NSString's stringWithString:sourceText)'s lowercaseString() as text
	on error errorText number errornumber partial result errorResults from errorObject to errorExpectedType
		error "<String_Case_Lower>" & errorText number errornumber partial result errorResults from errorObject to errorExpectedType
	end try
end String_Case_Lower

I’ll leave my first-draft handler name as-is. Initially I assumed the entire string should be lowercased if the threshold was hit then I re-read the post and concluded differently.

peavine · March 28, 2025, 10:10pm

This is my implementation of Paul’s understanding. The regex pattern may need refinement depending on applesource’s exact requirements.

use framework "Foundation"
use scripting additions

set theString to "This SENTENCE has FOUR all caps WORDS IN A ROW."
set editedString to getEditedString(theString, 4) -->"This SENTENCE has FOUR all caps words in a row."

on getEditedString(theString, thresholdCount)
	set theString to current application's NSMutableString's stringWithString:theString
	set thePattern to "(?:\\b[A-Z]+\\b[\\h,.]){" & thresholdCount & ",}"
	set theRegEx to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
	set regexResults to theRegEx's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
	set theRanges to (regexResults's valueForKey:"range")
	repeat with aRange in theRanges
		set aSubstring to (theString's substringWithRange:aRange)'s lowercaseString()
		(theString's replaceCharactersInRange:aRange withString:aSubstring)
	end repeat
	return theString as text
end getEditedString

Nigel_Garvey · March 30, 2025, 9:54am

Sorry to be late to the party. I’ve been away for a couple of days.

As peavine’s hinted, it’s not possible to come up with a definitive pattern without knowing what’s meant by “consecutive all-caps words” — ie. what’s allowed between the “words” in the way of white space and punctuation.

Just for fun, the following modification of peavine’s pattern also works with diacritical characters and where the last matching word is at the very end of the string. All non-word characters between words are treated as word separators except for “@”, which (in this exercise) is effectively treated as a lower-case character.

set thePattern to "(?:(?<!@)\\b\\p{Upper}++\\b(?!@)[^\\w@]*+){" & thresholdCount & ",}"

applesource · March 30, 2025, 4:46pm

Yeah, I should have provided a concrete example of the body text BEFORE and AFTER transformation, along with # of all-caps words required.

applesource · March 30, 2025, 4:55pm

Thank you to everyone who contributed to this thread.

I could have written my own purely iterative code, or hybrid code with a partial regex solution, but…

Nigel’s code is exactly what I was looking for–a regex one-liner. Works great in BBEdit (haven’t coded it in AS yet). Though I’ll need to modify to handle a common case of the authors of these archived documents going back to the 70’s–putting a random number of ‘.’ between UC words (2-5) trying to represent pseudo-elipsis

It’s been a long time since I 1st learned regex (in my preferred scripting language at the time-Perl) running on 68k Macs

Glad to have the assistance.

applesource · March 30, 2025, 5:15pm

One more thing…

The only thing that should halt a match once the minimum # of all-caps words is found, is ANY lower-case letter (a-z). So, any non letter characters between words should also be captured up until a lower-case letter is found.

I’m going to attempt to modify Nigel’s regex myself to accomplish this.

Nigel_Garvey · March 31, 2025, 10:47am

Be aware that BBEdit uses a “PCRE-based grep engine” (according to its manual) whereas the “Foundation” framework’s NSRegularExpression class in ASObjC uses the ICU flavour. They seem to be mostly the same, but there are one or two minor differences. What works in BBEdit may not necessarily work in ASObjC, and vice versa.

peavine · March 31, 2025, 1:03pm

Nigel. I noticed a few minor items in your regex pattern that might be attributable to the above and wondered if that was the case. First, I couldn’t find an Upper Unicode category in the ICU or Unicode documentation, although your regex pattern seemed to work fine in a script. Secondly, the ICU documentation showed the Unicode category as a character class (e.g. [\p{Letter}]). However, once again, your regex pattern seemed to work fine in a script.

Anyways, just for practice, I rewrote my earlier regex pattern to include a few of your suggestions and to meet applescource’s latest request. I used the ASObjC script that I posted above as a test vehicle (I don’t have BBEdit).

use framework "Foundation"
use scripting additions

set theString to "There are FOUR all caps WORDS - IN ; A : ROW FOllowED by a period."

set newString to getNewString(theString, 4) -->"There are FOUR all caps words - in ; a : row followED by a period."

on getNewString(theString, thresholdCount)
	set theString to current application's NSMutableString's stringWithString:theString
	set thePattern to "(?:\\b\\p{Uppercase}+\\b\\W+){" & thresholdCount & ",}.*?\\p{Lowercase}"
	set theRegEx to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
	set regexResults to theRegEx's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
	set theRanges to (regexResults's valueForKey:"range")
	repeat with aRange in theRanges
		set aSubstring to (theString's substringWithRange:aRange)'s lowercaseString()
		(theString's replaceCharactersInRange:aRange withString:aSubstring)
	end repeat
	return theString as text
end getNewString

BTW, there’s a question of what should happen if the minimum number of all-caps words are not followed by a lowercase letter (at the end of the string). In this situation, my script does not lowercase the immediately preceding minimum number of all-caps words. This is easily changed.

Nigel_Garvey · March 31, 2025, 3:24pm

Hi peavine.

You’ll see I edited my modification of your pattern a couple of times after posting it.

I originally used the Unix-like set [[:upper:]], with which I was familiar from shell scripting. Then, since this sort of thing isn’t mentioned in the Apple documentation (although it is in the Unicode blurb), I looked to see if there was a \p{} equivalent and found the Uppercase category you’ve used above. Later, I experimented to see if any abbreviations worked and it turned out that Upper does, so I shortened it to this.

I thought you’d queried my use of [^\w] instead of \W, but I can’t see it now. Maybe you edited it out of your post. But anyway, in this respect, I was experimenting with not ignoring a particular non-word character — in this case “@” — in or between upper-case words. My first attempt was [\W&&[^@]], meaning a character that’s both a non-word character and not “@”. Then I realised it would be simpler and shorter to use [^\w@], ie. a character that’s neither a word character nor “@”. The logic’s exactly the same either way. I don’t know which executes faster. Probably the latter. Also, the former doesn’t work in BBEdit.

peavine · March 31, 2025, 3:34pm

Nigel. Thanks for the information.

I did raise the issue quoted above, but I then realized you were negating two characters in the character class, so I removed that part of my post. Also, upon rechecking, \p is shown in the ICU documentation as a metacharacter, so I was just wrong about the possible need to include it in a character class.

Nigel_Garvey · April 1, 2025, 12:49pm

It turned out that applesource actually wanted a regex pattern, but I thought I’d have a go at a vanilla-ish AppleScript solution too. Like Paul’s it contains a single line of ASObjC code, but here it’s used to create a list of capitalised versions of all the words (used for the comparisons) and a lower-cased version of the entire string (used as a source of replacement text).

use AppleScript version "2.4" -- OS X 10.10 (Yosemite) or later
use framework "Foundation"
use scripting additions

set theString to "This SENTENCE has THREE all caps WORDS IN A row. This SENTENCE HAS THREE all CAPS WORDS in a ROW."
set threshold to 3
set theString to my stringByLowercasingSubstringsOf:theString whoseMinimumAllCapsWordCountIs:threshold

on stringByLowercasingSubstringsOf:theString whoseMinimumAllCapsWordCountIs:threshold
	set theWords to theString's words
	tell (current application's NSString's stringWithString:(theString)) to set {upperWords, lowerString} to {(its uppercaseString() as text)'s words, its lowercaseString() as text}
	set i to 1
	considering case
		repeat with j from 1 to (count theWords)
			if (theWords's item j ≠ upperWords's item j) then
				if (j - i ≥ threshold) then set theString to replaceText(theString, theString's text from word i to word (j - 1), lowerString's text from word i to word (j - 1))
				set i to j + 1
			end if
		end repeat
		if (j + 1 - i ≥ threshold) then set theString to replaceText(theString, theString's text from word i to word j, lowerString's text from word i to word j)
	end considering
	return theString
end stringByLowercasingSubstringsOf:whoseMinimumAllCapsWordCountIs:

on replaceText(txt, find, replace)
	set astid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to find
	set tis to txt's text items
	set AppleScript's text item delimiters to replace
	set txt to tis as text
	set AppleScript's text item delimiters to astid
	return txt
end replaceText

peavine · April 1, 2025, 3:08pm

I frequently use a shortcut to test regex patterns, and it caused me to find an error (although of no impact) in my regex pattern that complies with applesource’s updated request. The error occurred because the pattern included the lowercase letter at the end of the match. The solution was a positive lookahead.

Regex Pattern Tester.shortcut (21.9 KB)

peavine · April 1, 2025, 9:37pm

Just for practice, I had earlier written a shortcut solution, but it didn’t work reliably in every circumstance. I found a fix for that, and the revised shortcut is included below. It’s called as a Quick Action and works on text selected in just about any app.

Lowercase Selected Text.shortcut (22.6 KB)

applesource · April 2, 2025, 4:47am

Anyone know if Obj C supports the control string \L, which changes the match it’s applied to to lowercase?

Using the replacePattern func from my OP, I should be able to find and replace uppercase words matching the pattern in one line, like this

set allCapsPattern to "((?:(?<!@)\\b\\p{Upper}++\\b(?!@)[^\\w@]*+){10,})"

set newText to my (replacePattern:allCapsPattern inString:str usingThis:"\\L$1")

Instead of transforming the text, it adds an ‘L’ to the beginning of the string.

This method works fine in BBEdit; what’s the trick in Obj C?

Nigel_Garvey · April 2, 2025, 8:55am

It doesn’t. But there’s an alternative to your handler here which allows case-change notation to be used. The template format in this case is “$L1” rather than “\\L$1”. However, your example above has the entire pattern as a capture group. You could instead omit the outer parentheses and use the replacement template “$L0”.

applesource · April 2, 2025, 9:57pm

Nice, thanks.

I get a twofer here as I’ve been trying to find decent Title Case code.

Got it.