AppleScript to delete any non-highlighted word in a Word doc

Your script took 2 minutes on a 5087 word doc and returned 380 words.

Robert’s took 19 seconds and returned 446 word.

The document actually has 480 words highlighted in bright green, BUT

  1. There are many lines of dialogue that start with a dash + space.
    Once those are removed, the count drops to 406.

Note: The text I work on are in French where exclamation and question marks are preceded by a non-breaking space, so I think that the script still count those as words, which is fine by me. I use the word counts for billing (I’m a translator) and do want that type of punctuation included as it is part of the translation.

So for my needs, for this particular document, 406 should be the count. Technically, the actual word count (without “!” and “?” is 379.

Going back to that alternate approach I mentioned, I think a script that would delete all non-highlighted text (or alternatively all text non-highlighted in a specific color if not too intensive) would would result in a less complicated script, wouldn’t it?

You can edit my script to add more specific 1-character words to be skipped

Yep. In a small sample text, the count is accurate, but on the large one from earlier, there was a discrepancy. My guess is that the large text (not written by me) used a different dash (minus) character than the one listed in the script. There might be other characters I didn’t take into account. I’ll do more testing :slight_smile:

Just a quick follow up to say there was a much easier solution than scripting to my alternate approach. I just recorded a macro in Word to delete all non-highlighted text (find and replace with a space) and assigned that macro a keyboard shortcut. Boom, instant deletion of all non-highlighted text on a 9000 word doc.

Processing RTF is easier and better than controlling Microsoft Word.
RTF is a son of Word. RTF is a subset of word document.
We can export Word file to RTF by using “asve as ‌format rtf” command.

Now, we can detect each color domain wihch accepts some color data range.
“Red” domain does not mean only {65535,0,0}. We can pick up {65535,10,10} or {65000,0,0} as red color.

http://piyocast.com/as/wp-content/uploads/2018/09/Color-Domain.pdf

Deleting some color domain word is feasible one. No one will hesitate If ordered as a work.

Going back to what I think the original question was, and assuming that the Word file is .docx and is saved somewhere writable, this will extract a list of the highlight colours and the words they are applied to from a 10,000-word Word file in about 4 seconds.

It unzips the .docx file and reads in the xml version of the text then uses text item delimiters to isolate the highlights, sorts them into a list, deletes the temp folder containing the unzipped and puts the list on the clipboard. I’m sure the xml could be analysed in a much cleverer way but this does seem to work!

It doesn’t count the words but that would be easy enough to add.

(With apologies to whomever I lifted the bubble sort routine from.)

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

tell application "Microsoft Word"
	set DocPath to (full name of document 1)
	set DocFolder to (path of document 1)
end tell

set WordSource to (quoted form of (POSIX path of DocPath))
set OpenedArchivePath to ((quoted form of (POSIX path of DocFolder)) & "/temp/")

do shell script ("unzip " & WordSource & " -d " & OpenedArchivePath)

delay 0.5

set DocumentText to (read file (DocFolder & ":temp:word:document.xml") from 1 to eof)

delay 0.5

tell application "System Events"
	try
		delete folder (DocFolder & ":temp")
	end try
end tell

set t to DocumentText

set l to {}
set FinalList to {}

set AppleScript's text item delimiters to {"<w:highlight w:val="}

set i to (text items 2 thru -1 of t)

set AppleScript's text item delimiters to {"</w:t>"}

repeat with v from 1 to count of items of i
	set end of l to (text item 1 of (item v of i))
end repeat

set AppleScript's text item delimiters to {"/></w:rPr><w:t>"}

repeat with p from 1 to count of items of l
	set NextHighlight to (item p of l)
	set end of FinalList to ((text item 1 of NextHighlight) & tab & (text item 2 of NextHighlight))
end repeat

set FinalList to (BubbleSort(FinalList))
set AppleScript's text item delimiters to {return}
set FinalList to every item of FinalList as text

set the clipboard to FinalList
display dialog FinalList buttons {"Cancel", "OK"} default button "OK"

on BubbleSort(theList)
	if class of theList is list then
		set theSize to length of theList
		repeat with i from 1 to theSize
			repeat with j from 2 to (theSize - i + 1)
				if ((item (j - 1) of theList) > (item j of theList)) then
					set temp to (item (j - 1) of theList)
					set (item (j - 1) of theList) to (item j of theList)
					set (item j of theList) to temp
				end if
			end repeat
		end repeat
		return theList
	else
		return false
	end if
end BubbleSort

I merged your script with mine to create a super fast counting. (near instantaneous)

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

property hColors : {"Auto", "Black", "Blue", "Turquoise", "Bright Green", "Pink", "Red", "Yellow", "White", "Dark Blue", "Teal", "Green", "Violet", "Dark Red", "Dark Yellow", "Gray 50", "Gray 25", "unknown"}

on run
	local DocPath, DocFolder, DocumentText, WordColor, nHighlightedWords
	set myColor to choose from list hColors with title "Windows Hightlight Colors" with prompt "Please choose a highlight color..."
	if class of myColor is boolean then return -- user chose 'Cancel'
	set WordColor to item 1 of myColor
	set nHighlightedWords to 0
	tell application "Microsoft Word"
		set DocPath to (full name of document 1)
		set DocFolder to (path of document 1)
	end tell
	
	set WordSource to (quoted form of (POSIX path of DocPath))
	set OpenedArchivePath to ((quoted form of (POSIX path of DocFolder)) & "/temp/")
	
	do shell script ("unzip " & WordSource & " -d " & OpenedArchivePath)
	set DocumentText to (read file (DocFolder & ":temp:word:document.xml") from 1 to eof)
	tell application "System Events"
		try
			delete alias (DocFolder & ":temp:")
		end try
	end tell
	
	set text item delimiters to {"<w:highlight w:val=\"" & WordColor & "\""}
	set DocumentText to (rest of text items of DocumentText)
	set text item delimiters to {"</w:t>"}
	
	repeat with i from 1 to count DocumentText
		set item i of DocumentText to (text item 1 of (item i of DocumentText))
	end repeat
	
	set text item delimiters to {"<w:t xml:space=\"preserve\">", "<w:t>"} -- {"/></w:rPr><w:t xml:space=\"preserve\">", "/></w:rPr><w:t>"}
	repeat with NextHighlight in DocumentText
		set nHighlightedWords to nHighlightedWords + (count (words of text item 2 of NextHighlight))
	end repeat
	activate me
	display alert "# of words with highlight color \"" & myColor & "\" is " & nHighlightedWords giving up after 4
	return nHighlightedWords
end run

Here is an even better version that doesn’t create a temp folder

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

property hColors : {"Auto", "Black", "Blue", "Turquoise", "Bright Green", "Pink", "Red", "Yellow", "White", "Dark Blue", "Teal", "Green", "Violet", "Dark Red", "Dark Yellow", "Gray 50", "Gray 25", "unknown"}

on run
	local DocPath, DocumentText, WordSource, WordColor, nHighlightedWords
	set myColor to choose from list hColors with title "Windows Hightlight Colors" with prompt "Please choose a highlight color..."
	if class of myColor is boolean then return -- user chose 'Cancel'
	set WordColor to item 1 of myColor
	set nHighlightedWords to 0
	tell application "Microsoft Word"
		set DocPath to (full name of document 1)
	end tell
	set WordSource to (quoted form of (POSIX path of DocPath))
	set DocumentText to do shell script ("unzip -p " & WordSource & " -x /word/document.xml")
	set text item delimiters to {"<w:highlight w:val=\"" & WordColor & "\""}
	set DocumentText to (rest of text items of DocumentText)
	set text item delimiters to {"<w:t xml:space=\"preserve\">", "<w:t>", "</w:t>"}
	repeat with NextHighlight in DocumentText
		try
			set nHighlightedWords to nHighlightedWords + (count (words of text item 2 of NextHighlight))
		end try
	end repeat
	activate me
	display alert "# of words with highlight color \"" & myColor & "\" is " & nHighlightedWords giving up after 4
	return nHighlightedWords
end run

Hi guys,

Thanks for trying still. Robert, I tried your last 2 scripts with a document that contained yellow, bright green and turquoise highlights. All returned a count of 0 except for yellow that returned 65 in the first script and 56 in the second (the correct Count is actually 51).

All the other colors correctly returned 0, except for green that generated the following script error:

Can’t get text item 2 of “/><w:lang w:val="fr-FR"/></w:rPr></w:pPr><w:r w:rsidRPr="002F48D2"><w:rPr><w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/><w:sz w:val="32"/>”.

Can I get a copy of that document?

1 Like

Just sent you a message with a link.

Unzipping the document.xml file straight into a variable is much cleaner – I did think that might be possible but I couldn’t find the syntax.

I wonder if the XML of 80sTherapy’s real file is a bit more complex than our examples?

Ignoring for the moment whether there are complications to the XML that are tripping us up, here is a version that makes use of robertfern’s unzip syntax then shows the user a dialog listing all the highlight colors found with a count of words and an option to save either the full list or just what’s in the dialog to the clipboard.

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

tell application "Microsoft Word"
	set DocPath to (full name of document 1)
end tell
set WordSource to (quoted form of (POSIX path of DocPath))
set DocumentText to do shell script ("unzip -p " & WordSource & " -x /word/document.xml")

set t to DocumentText

set l to {}
set FinalList to {}

set AppleScript's text item delimiters to {"<w:highlight w:val="}
set i to (text items 2 thru -1 of t)
set AppleScript's text item delimiters to {"</w:t>"}
repeat with v from 1 to count of items of i
	set end of l to (text item 1 of (item v of i))
end repeat

set ColorsInUse to {}
repeat with p in l
	set AppleScript's text item delimiters to {"/></w:rPr><w:t>"}
	set NextHighlight to p
	set TI1 to (text item 1 of NextHighlight)
	set TI2 to (text item 2 of NextHighlight)
	set AppleScript's text item delimiters to {"\""}
	set TI1 to (text item 2 of TI1)
	if ColorsInUse does not contain TI1 then
		set end of ColorsInUse to TI1
	end if
	set end of FinalList to (TI1 & tab & TI2)
end repeat

set ColorsInUseWithWords to {}
repeat with f from 1 to count of items of ColorsInUse
	set s to (item f of ColorsInUse)
	set end of ColorsInUseWithWords to {s, {}, 0}
end repeat

set AppleScript's text item delimiters to {tab}
repeat with f from 1 to count of items of ColorsInUse
	set s to (item f of ColorsInUse)
	repeat with g in FinalList
		if s = ((text item 1) of g) then
			set end of item 2 of item f of ColorsInUseWithWords to ((text item 2) of g)
		end if
	end repeat
end repeat

repeat with w in ColorsInUseWithWords
	set (item 3 of w) to (count of items of (item 2 of w))
end repeat

set ReportText to ""
repeat with w in ColorsInUseWithWords
	set ReportText to ReportText & ((item 1 of w) & tab & ((item 3 of w) as text) & return)
end repeat

set AppleScript's text item delimiters to {" "}

set FullText to ""
repeat with w in ColorsInUseWithWords
	set FullText to FullText & ((item 1 of w) & tab & ((item 3 of w) as text) & return & (every item of (item 2 of w) as text) & return & return)
end repeat

set DialogAnswer to (display dialog (ReportText & return & "What do you want copied to the clipboard?") with title "Counts of Highlighted Words" buttons {"All", "Totals", "Nothing"} default button "Nothing" giving up after 5)

if button returned of DialogAnswer is "All" then
	set the clipboard to FullText
end if
if button returned of DialogAnswer is "Totals" then
	set the clipboard to ReportText
end if

OK, I see that your Word Doc is in French. SO that means your Word is a localized version in French.
The Colors are probably in a different name and or language.
I noticed that the file, when opened on my US version, has the most test in “Bright Green” highlight.
But as it is saved in your version, it is “green”.

Also i edited my script above to use a tell block

Hi Robert - I live in the US and my system language is set to English, and so is the keyboard input. The only thing in French is the spellchecker when I work on French documents. The Word highlight colors names should be the same as yours. The document I sent you uses three: Bright Green, Turquoise and Yellow.

I tried your screen which now returns 331 for “Green” and still 0 for “Bright Green”. That is weird because the green I use is definitely the latter:

This is what Green looks like (not used in my doc)

But in the xml part of your save file it is listed as “green”

Weird. Does it do the same if your own test document, or does it say bright green?

Weird. So Bright Green gets saved as “green”.
and Green gets saved as “darkgreen”

Stupid Microsoft.
So the names in the actual saved xml don’t match exactly the color names saved in the SDEF file

Here is a new version

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

property hColors : {"Auto", "Black", "Blue", "Turquoise", "Bright Green", "Pink", "Red", "Yellow", "White", "Dark Blue", "Teal", "Green", "Violet", "Dark Red", "Dark Yellow", "Gray 50", "Gray 25", "unknown"}
property hColorConstants : {"Auto", "Black", "Blue", "Cyan", "green", "Pink", "Red", "Yellow", "White", "darkBlue", "darkCyan", "darkGreen", "darkMagenta", "darkRed", "darkYellow", "Gray", "lightGray", "unknown"}

on run
	local DocPath, DocumentText, WordSource, WordColor, nHighlightedWords, n
	set myColor to choose from list hColors with title "Windows Hightlight Colors" with prompt "Please choose a highlight color..."
	if class of myColor is boolean then return -- user chose 'Cancel'
	set WordColor to item 1 of myColor
	set n to getIndexOfItemInList(WordColor, hColors)
	set WordColor to item n of hColorConstants
	set nHighlightedWords to 0
	tell application "Microsoft Word"
		set DocPath to (full name of document 1)
	end tell
	set WordSource to (quoted form of (POSIX path of DocPath))
	set DocumentText to do shell script ("unzip -p " & WordSource & " -x /word/document.xml")
	set text item delimiters to {"<w:highlight w:val=\"" & WordColor & "\""}
	set DocumentText to (rest of text items of DocumentText)
	set text item delimiters to {"<w:t xml:space=\"preserve\">", "<w:t>", "</w:t>"}
	repeat with NextHighlight in DocumentText
		try
			set nHighlightedWords to nHighlightedWords + (count (words of text item 2 of NextHighlight))
		end try
	end repeat
	activate me
	display alert "# of words with highlight color \"" & myColor & "\" is " & nHighlightedWords giving up after 4
	return nHighlightedWords
end run

on getIndexOfItemInList(theItem, theList)
	script L
		property aList : theList
	end script
	repeat with a from 1 to count of L's aList
		if item a of L's aList is theItem then return a
	end repeat
	return 0
end getIndexOfItemInList

Just for the record, the version above that I posted yesterday should be indifferent to the actual names of the highlight colours recorded in the XML . . . .

I would guess that if you crawled back through the depths of time, you would find that Word didn’t always offer the current palette of colours and thus, some of the colour names were already used. Then when they added more colours, they had to play around with the colour names so that old documents wouldn’t break.