Find and replace 'n' consecutive all-caps words in string

I have body text in documents where I need to identify and replace sections with all-caps.

I want to write a func that takes a string (str) and integer (threshold) and replaces consecutive all-caps words with lowercase when threshold is met (meaning ‘threshold’ number of consecutive words in all-caps).

I’m currently using this func which I found in a Macscripter thread for regex search/replace, which works great:

on replacePattern:thePattern inString:theString usingThis:theTemplate
	set theRegEx to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
	set theResult to theRegEx's stringByReplacingMatchesInString:theString options:0 range:{location:0, |length|:length of theString} withTemplate:theTemplate
	return theResult as text
end replacePattern:inString:usingThis:

I use BBEdit to test my regex expressions before coding them in Applescript.

But, I’m at a loss to construct a regex that can match a random number of consecutive all-caps words, let alone replace them with lowercase. This is despite Google searches on stackoverflow and elsewhere with code examples. Not sure if this is a problem with BBEdit’s version of grep, or the grep expression itself.

Any help here would be appreciated.

applesource. I’m a bit confused by your request, so the following is FWIW. The regex pattern ([A-Z]{2,}) works on two or more consecutive uppercase letters in the range of A to Z. This can be modified to enforce word boundaries and to work with different letters (and probably languages) and numbers of letters. I don’t know anything about BBEdit, if that’s what you’re asking about.

The timing result with a string that contained 1863 words was 50 milliseconds. This assumes that the Foundation framework is in memory, which would normally be the case.

use framework "Foundation"
use scripting additions

set theString to "This is SOME teXT and some IS in Uppercase"
set editedString to getEditedString(theString) -->"This is some text and some is in Uppercase"

on getEditedString(theString)
	set editedString to current application's NSMutableString's stringWithString:theString
	set theRegEx to current application's NSRegularExpression's regularExpressionWithPattern:"[A-Z]{2,}" options:0 |error|:(missing value)
	set regexResults to theRegEx's matchesInString:editedString options:0 range:{location:0, |length|:editedString's |length|()}
	set theRanges to (regexResults's valueForKey:"range")
	repeat with aRange in theRanges
		set lowercaseSubstring to (editedString's substringWithRange:aRange)'s lowercaseString()
		(editedString's replaceCharactersInRange:aRange withString:lowercaseSubstring)
	end repeat
	return editedString as text
end getEditedString

And just as confused, here is a script that demotes n consecutive capital letters in a word using the ICU Regex that Objective-C, and Swift use but not BBEdit.

Given the string “This is a stuTTering CAPitals exampLE.”

use framework "Foundation"
use scripting additions

property ca : current application

set s to ca's NSMutableString's stringWithString:"this is a stuTTering CAPitals exampLE."
set CapsInString to 2 as integer
set thePat to "[A-Z]{" & CapsInString & "}(?<![A-Z]{" & CapsInString + 1 & "})(?![A-Z])"
log (thePat) as text
display dialog (my replacePattern:thePat inString:s withConsecutiveCaps:CapsInString)
return


on replacePattern:thePattern inString:theString withConsecutiveCaps:n
	set s to ca's NSMutableString's stringWithString:theString
	set theRegEx to ca's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
	set srange to ca's NSMakeRange(0, theString's |length|())
	set matches to theRegEx's matchesInString:theString options:0 range:srange
	if (count of matches) > 0 then
		repeat with amatch in matches
			set matchedText to (s's substringWithRange:(amatch's range))
			log (count of (matchedText as text)) as integer
			(s's replaceOccurrencesOfString:matchedText withString:(matchedText's localizedLowercaseString))
			
		end repeat
		return (s) as text
	else
		return "No string match"
	end if
end replacePattern:inString:withConsecutiveCaps:

Will display a string with two consecutive capitals replaced by lowercase.

Screenshot 2025-03-28 at 4.06.37 PM

Thanks, will check out these examples.

But, as my OP says:

But, I’m at a loss to construct a regex that can match a random number of consecutive all-caps words,

So the ‘threshold’ var mentioned in my func denotes a minimum number of consecutive all-cap words required, before all words in that group would be converted to lowercase. I.e.: if threshold = 10

  • find group of, minimum 10 consecutive all-caps words before transforming.

I love a puzzle. This is a puzzle where the puzzle itself has to be puzzled out.

I read this as "I want to lowercase ONLY CONSECUTIVE ALL-CAPS WORDS in a given string WHERE the number of CONSECUTIVE ALL-CAPS WORDS in the string is greater than a threshold integer value, leaving all other text, including ALL_CAPS words that do not reach the consecutive-all-caps-word threshold. whew!

Un-Optimized code…

--3/28/25 https://www.macscripter.net/u/paulskinner/
use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

set theString to "This SENTENCE has THREE all caps WORDS IN A row"
set theThreshold to 3
Lowercase_Strings_Whose_All_Caps_Words_Exceed_Count(theString, theThreshold)



on Lowercase_Strings_Whose_All_Caps_Words_Exceed_Count(theString, theThreshold)
	set wordsToLowercase to {}
	set ConsecutiveAllCapsWordCount to 0
	set wordsOfTheString to words of theString
	set AppleScript's text item delimiters to characters of "abcdefghijklmnopqrstuvwxyz"
	repeat with i from 1 to length of wordsOfTheString
		set currentWord to (item i of wordsOfTheString) as text
		considering case
			set testWord to item 1 of (text items of currentWord)
		end considering
		if testWord is currentWord then
			set ConsecutiveAllCapsWordCount to ConsecutiveAllCapsWordCount + 1
		else
			set ConsecutiveAllCapsWordCount to 0
		end if
		if ConsecutiveAllCapsWordCount = theThreshold then
			--set the last theThreshold words to lowercase
			repeat with x from theThreshold to 1 by -1
				set the end of wordsToLowercase to (-1 * (x - i)) + 1
			end repeat
			set ConsecutiveAllCapsWordCount to 0
		end if
	end repeat
	set theOutputString to ""
	repeat with i from 1 to length of wordsOfTheString
		set currentWord to (item i of wordsOfTheString) as text
		if wordsToLowercase contains i then
			set theOutputString to theOutputString & " " & String_Case_Lower(currentWord)
		else
			set theOutputString to theOutputString & " " & currentWord
		end if
	end repeat
	return theOutputString
end Lowercase_Strings_Whose_All_Caps_Words_Exceed_Count

on String_Case_Lower(sourceText)
	try
		return (current application's NSString's stringWithString:sourceText)'s lowercaseString() as text
	on error errorText number errornumber partial result errorResults from errorObject to errorExpectedType
		error "<String_Case_Lower>" & errorText number errornumber partial result errorResults from errorObject to errorExpectedType
	end try
end String_Case_Lower

I’ll leave my first-draft handler name as-is. Initially I assumed the entire string should be lowercased if the threshold was hit then I re-read the post and concluded differently.

This is my implementation of Paul’s understanding. The regex pattern may need refinement depending on applesource’s exact requirements.

use framework "Foundation"
use scripting additions

set theString to "This SENTENCE has FOUR all caps WORDS IN A ROW."
set editedString to getEditedString(theString, 4) -->"This SENTENCE has FOUR all caps words in a row."

on getEditedString(theString, thresholdCount)
	set theString to current application's NSMutableString's stringWithString:theString
	set thePattern to "(?:\\b[A-Z]+\\b[\\h,.]){" & thresholdCount & ",}"
	set theRegEx to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
	set regexResults to theRegEx's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
	set theRanges to (regexResults's valueForKey:"range")
	repeat with aRange in theRanges
		set aSubstring to (theString's substringWithRange:aRange)'s lowercaseString()
		(theString's replaceCharactersInRange:aRange withString:aSubstring)
	end repeat
	return theString as text
end getEditedString
1 Like

Sorry to be late to the party. I’ve been away for a couple of days.

As peavine’s hinted, it’s not possible to come up with a definitive pattern without knowing what’s meant by “consecutive all-caps words” — ie. what’s allowed between the “words” in the way of white space and punctuation.

Just for fun, the following modification of peavine’s pattern also works with diacritical characters and where the last matching word is at the very end of the string. All non-word characters between words are treated as word separators except for “@”, which (in this exercise) is effectively treated as a lower-case character. :slightly_smiling_face:

set thePattern to "(?:(?<!@)\\b\\p{Upper}++\\b(?!@)[^\\w@]*+){" & thresholdCount & ",}"
1 Like

:laughing:

Yeah, I should have provided a concrete example of the body text BEFORE and AFTER transformation, along with # of all-caps words required.

Thank you to everyone who contributed to this thread.

I could have written my own purely iterative code, or hybrid code with a partial regex solution, but…

Nigel’s code is exactly what I was looking for–a regex one-liner. Works great in BBEdit (haven’t coded it in AS yet). Though I’ll need to modify to handle a common case of the authors of these archived documents going back to the 70’s–putting a random number of ‘.’ between UC words (2-5) trying to represent pseudo-elipsis :roll_eyes:

It’s been a long time since I 1st learned regex (in my preferred scripting language at the time-Perl) running on 68k Macs :grimacing:

Glad to have the assistance.

One more thing…

The only thing that should halt a match once the minimum # of all-caps words is found, is ANY lower-case letter (a-z). So, any non letter characters between words should also be captured up until a lower-case letter is found.

I’m going to attempt to modify Nigel’s regex myself to accomplish this.

Be aware that BBEdit uses a “PCRE-based grep engine” (according to its manual) whereas the “Foundation” framework’s NSRegularExpression class in ASObjC uses the ICU flavour. They seem to be mostly the same, but there are one or two minor differences. What works in BBEdit may not necessarily work in ASObjC, and vice versa.

Nigel. I noticed a few minor items in your regex pattern that might be attributable to the above and wondered if that was the case. First, I couldn’t find an Upper Unicode category in the ICU or Unicode documentation, although your regex pattern seemed to work fine in a script. Secondly, the ICU documentation showed the Unicode category as a character class (e.g. [\p{Letter}]). However, once again, your regex pattern seemed to work fine in a script.

Anyways, just for practice, I rewrote my earlier regex pattern to include a few of your suggestions and to meet applescource’s latest request. I used the ASObjC script that I posted above as a test vehicle (I don’t have BBEdit).

use framework "Foundation"
use scripting additions

set theString to "There are FOUR all caps WORDS - IN ; A : ROW FOllowED by a period."

set newString to getNewString(theString, 4) -->"There are FOUR all caps words - in ; a : row followED by a period."

on getNewString(theString, thresholdCount)
	set theString to current application's NSMutableString's stringWithString:theString
	set thePattern to "(?:\\b[\\p{Uppercase}]+\\b\\W+){" & thresholdCount & ",}.*?[\\p{Lowercase}]"
	set theRegEx to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
	set regexResults to theRegEx's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
	set theRanges to (regexResults's valueForKey:"range")
	repeat with aRange in theRanges
		set aSubstring to (theString's substringWithRange:aRange)'s lowercaseString()
		(theString's replaceCharactersInRange:aRange withString:aSubstring)
	end repeat
	return theString as text
end getNewString

BTW, there’s a question of what should happen if the minimum number of all-caps words are not followed by a lowercase letter (at the end of the string). In this situation, my script does not lowercase the immediately preceding minimum number of all-caps words. This is easily changed.

Hi peavine.

You’ll see I edited my modification of your pattern a couple of times after posting it. :wink:

I originally used the Unix-like set [[:upper:]], with which I was familiar from shell scripting. Then, since this sort of thing isn’t mentioned in the Apple documentation (although it is in the Unicode blurb), I looked to see if there was a \p{} equivalent and found the Uppercase category you’ve used above. Later, I experimented to see if any abbreviations worked and it turned out that Upper does, so I shortened it to this.

I thought you’d queried my use of [^\w] instead of \W, but I can’t see it now. Maybe you edited it out of your post. But anyway, in this respect, I was experimenting with not ignoring a particular non-word character — in this case “@” — in or between upper-case words. My first attempt was [\W&&[^@]], meaning a character that’s both a non-word character and not “@”. Then I realised it would be simpler and shorter to use [^\w@], ie. a character that’s neither a word character nor “@”. The logic’s exactly the same either way. I don’t know which executes faster. Probably the latter. Also, the former doesn’t work in BBEdit.

1 Like

Nigel. Thanks for the information.

I did raise the issue quoted above, but I then realized you were negating two characters in the character class, so I removed that part of my post. Also, upon rechecking, \p is shown in the ICU documentation as a metacharacter, so I was just wrong about the possible need to include it in a character class.