Word processing, how to create a list sorted by word or by frequency

I want to count how many times each word in a document appears in that document. Example, if the word Hogmanay is in there 10 times and Christmas 4 times in a text with 5000 words I like to see it in a list. Words like ‘and’ or ‘the’ can be included I do not mind.
However I want all of the 5000 words to be included in the list that are in the document. The list might be 100 or 500 words long or more, with some words scoring high while others score maybe 1 or 2.

Using AppleScript, is there a way to do so in Word or Text Edit or Open Office or Pages or TextWrangler or…
I know there is free software for Windows not this but I have no access to Windows at all. Is there free Mac software? Or is it build in somewhere? Or, as said via Apple Script or Automator?

Thanks

Quick and dirty answer:

use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

# using a script object to store the list fasten their treatment
script o
	property theWords : {}
	property theCounts : {}
	property theFullList : {}
end script

on indexOf:aValue inList:theList
	set theArray to current application's NSArray's arrayWithArray:theList
	set theIndex to theArray's indexOfObject:aValue
	if theIndex = current application's NSNotFound then
		return 0
	else
		return (theIndex + 1)
	end if
end indexOf:inList:

tell application "TextEdit" to tell document 1
	set theText to its text
end tell

set o's theFullList to words of theText
repeat with aWord in o's theFullList
	set maybe to (its indexOf:(aWord as text) inList:(o's theWords))
	if maybe = 0 then
		set end of o's theWords to (aWord as text)
		set end of o's theCounts to 1
	else
		set item maybe of o's theCounts to (item maybe of o's theCounts) + 1
	end if
end repeat
repeat with i from 1 to count o's theWords
	set maybe to item i of o's theWords as text
	set item i of o's theWords to {maybe, (item i of o's theCounts as integer)}
end repeat
o's theWords

Yvan KOENIG running Sierra 10.12.2 in French (VALLAURIS, France) samedi 31 décembre 2016 16:46:45

I wish to post a work in progress.

use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

script o
	property theWords : {}
	property theCounts : {}
	property byWords : {}
	property byFrequency : {}
end script

tell application "TextEdit" to tell document 1
	set theText to its text
end tell

set o's theWords to words of theText
set ArrayOfWords to current application's NSMutableArray's arrayWithArray:{}
set ArrayOfCapitalizedWords to current application's NSMutableArray's arrayWithArray:{}
--set ArrayOfCounts to current application's NSMutableArray's arrayWithArray:{}
repeat with aWord in o's theWords
	set aWord to (current application's NSString's stringWithString:aWord)
	set oneWord to (aWord's uppercaseString()) -- as text
	set maybe to (ArrayOfCapitalizedWords's indexOfObject:oneWord)
	if maybe = current application's NSNotFound then
		(ArrayOfWords's addObject:aWord)
		(ArrayOfCapitalizedWords's addObject:oneWord)
		set end of o's theCounts to 1
	else
		# grabs the old count and add 1
		set newCount to (item (maybe + 1) of o's theCounts) + 1
		set (item (maybe + 1) of o's theCounts) to newCount
	end if
end repeat
set o's theWords to ArrayOfWords as list
copy o's theWords to o's byWords
copy o's theWords to o's byFrequency

set space3 to space & space & space
set i to 0
repeat with aWord in o's theWords
	set aWord to aWord as text # Required
	set i to i + 1
	set item i of o's theWords to {aWord, (item i of o's theCounts as integer)}
	set item i of o's byWords to aWord & ", " & (item i of o's theCounts as integer)
	set item i of o's byFrequency to text -3 thru -1 of (space3 & (item i of o's theCounts as integer)) & ", " & aWord
end repeat

set theArray to current application's NSArray's arrayWithArray:(o's byWords)
set theArray to theArray's sortedArrayUsingSelector:"localizedStandardCompare:"
set o's byWords to theArray as list

set theArray to current application's NSArray's arrayWithArray:(o's byFrequency)
set theArray to theArray's sortedArrayUsingSelector:"localizedStandardCompare:"
set o's byFrequency to theArray as list

{o's theWords, o's byWords, o's byFrequency}

I wished to use a mutable array for the counts but I don’t know the correct syntax to change the value of an item in a mutable array, the array supposed to store the occurrences of the words is wrong.

CAUTION: As indexOfObject is case sensitive, the first version returned a wrong list. In my test file, “the” and “The” are treated as different words.

Yvan KOENIG running Sierra 10.12.2 in French (VALLAURIS, France) samedi 31 décembre 2016 18:06:11

Here is a relatively efficient non-ASOC way to do it.

set message to "I want to count how many times each word in a document appears in that document.  Example, if the word Hogmanay is in there 10 times and Christmas 4 times in a text with 5000 words I like to see it in a list.  Words like 'and' or 'the' can be included I do not mind.   
However I want all of the 5000 words to be included in the list that are in the document.  The list might be 100 or 500 words long or more, with some words scoring high while others score maybe 1 or 2.  

Using AppleScript, is there a way to do so in Word or Text Edit or Open Office or Pages or TextWrangler or...
I know there is free software for Windows not this but I have no access to Windows at all.  Is there free Mac software?  Or is it build in somewhere?  Or, as said via Apple Script or Automator?    

Thanks"'s words

#make records 
set freqRec to {}
repeat with focus from 1 to count message
	set counter to 0
	repeat with comparator from 1 to count message
		if my message's item comparator is my message's item focus then set counter to counter + 1
	end repeat
	set freqRec to my freqRec & (run script "{|" & my message's item focus & "|: " & counter & "}")
end repeat
#hack error to extract keys
try
	freqRec as text
on error err
end try
set AppleScript's text item delimiters to {"|", "Can't make ", " into type text.", "{", "}"}
set freqRec to {err's text items}
set AppleScript's text item delimiters to {""}
set freqRec to freqRec as text
set AppleScript's text item delimiters to {linefeed, ", "}
set freqRec to freqRec's text items
set freqRec to freqRec as text
#sort
(do shell script "echo " & (freqRec)'s quoted form & " | sort -df") --'s paragraphs

Another ASObjC offering:

use AppleScript version "2.4"
use framework "Foundation"

tell application "TextEdit" to set wordList to words of text of document 1

-- Get same-case versions of all the words.
set lowercasedWords to (current application's class "NSArray"'s arrayWithArray:(wordList))'s valueForKey:("lowercaseString")
-- Use an NSCountedSet to count the number of each.
set countedWords to current application's class "NSCountedSet"'s setWithArray:(lowercasedWords)
-- Create an array of dictionaries, each containing a word and its count.
set resultArray to current application's class "NSMutableArray"'s new()
set wordEnumerator to countedWords's objectEnumerator()
repeat (countedWords's |count|()) times
	set thisWord to wordEnumerator's nextObject()
	set thisCount to countedWords's countForObject:(thisWord)
	tell resultArray to addObject:({|word|:thisWord, |count|:thisCount})
end repeat
-- Reverse-sort the array on the counts.
set sortOnCount to current application's class "NSSortDescriptor"'s sortDescriptorWithKey:("count") ascending:(false)
tell resultArray to sortUsingDescriptors:({sortOnCount})
-- Coerce back to list and return the result
return resultArray as list

That’s definitely my last script this year. :wink:

And here are my first :slight_smile:

This is similar to Nigel’s:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set theWords to words of (the clipboard)
-- make counted set of all words in lowercase
set theWords to (current application's NSArray's arrayWithArray:theWords)'s valueForKey:"lowercaseString"
set theCountedSet to current application's NSCountedSet's setWithArray:theWords
-- get array of unique words
set uniqueWords to theCountedSet's allObjects()
-- build array of dictionaries containing both the words and their counts
set theList to current application's NSMutableArray's array()
repeat with aWord in uniqueWords
	(theList's addObject:{theWord:aWord, theCount:(theCountedSet's countForObject:aWord)})
end repeat
-- sort the array first by the count and second by the word
set desc1 to current application's NSSortDescriptor's sortDescriptorWithKey:"theCount" ascending:false
set desc2 to current application's NSSortDescriptor's sortDescriptorWithKey:"theWord" ascending:true
theList's sortUsingDescriptors:{desc1, desc2}
-- convert to tab-delimited text in form <word><tab><count><linefeed>
set newList to {}
repeat with aDict in theList
	set end of newList to ((aDict's objectForKey:"theWord") as text) & tab & (aDict's objectForKey:"theCount") as text
end repeat
set saveTID to AppleScript's text item delimiters
set AppleScript's text item delimiters to {linefeed}
set newList to newList as text
set AppleScript's text item delimiters to saveTID
return newList

This variation tries to retain the case of words that always appear in other than all-lowercase, like the OP’s examples of Hogmanay and Christmas:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set theWords to words of (the clipboard)
-- make array of words and matching lowercase array
set theWords to current application's NSArray's arrayWithArray:theWords
set theWordsLower to theWords's valueForKey:"lowercaseString"
-- make counted set of words in lowercase
set theCountedSet to current application's NSCountedSet's setWithArray:theWordsLower
-- get array of unique words
set uniqueWords to theCountedSet's allObjects()
-- build array of dictionaries containing both the words and their counts
set theList to current application's NSMutableArray's array()
repeat with aWord in uniqueWords
	(theList's addObject:{theWord:aWord, theCount:(theCountedSet's countForObject:aWord)})
end repeat
-- sort the array first by the count and second by the word
set desc1 to current application's NSSortDescriptor's sortDescriptorWithKey:"theCount" ascending:false
set desc2 to current application's NSSortDescriptor's sortDescriptorWithKey:"theWord" ascending:true
theList's sortUsingDescriptors:{desc1, desc2}
-- convert to tab-delimited text in form <word><tab><count><linefeed>
set newList to {}
repeat with aDict in theList
	set oneWord to (aDict's objectForKey:"theWord")
	if (theWords's containsObject:oneWord) as boolean is false then
		-- the original list didn't contain the lowercase version, so look up original array
		set theIndex to (theWordsLower's indexOfObject:oneWord)
		set oneWord to (theWords's objectAtIndex:theIndex)
	end if
	set end of newList to (oneWord as text) & tab & (aDict's objectForKey:"theCount") as text
end repeat
set saveTID to AppleScript's text item delimiters
set AppleScript's text item delimiters to {linefeed}
set newList to newList as text
set AppleScript's text item delimiters to saveTID
return newList

WOW! First of all happy new year folks and thank you for posting all your replies. Overwhelmed for choice! They all work and I need to test which one fits my purpose best.

Again a thousand thanks.

Exactly the same except for the word source, the style, the subsort on words, and the conversion to text at the end. :wink:

Things get difficult once you start down that road. :confused: If the text contains something like “It is important to distinguish between saLT and SALT. I’m going to be talking exclusively about the latter,” your solution only lists the former, with the combined count of both. (This can be got round by doing the case corrections before putting the words into the NSCountedSet.) And of course if the text also contains “salt”, only it will be listed. If case is important, it may be better not to convert to lower case in the first place and to leave it to the user interpret the results. Or perhaps to write something to specific requirements.

But given your variation, you can lose the second repeat and the TIDs by doing the substitutions in the first repeat and including a formatted string in each dictionary. I’ve reverted to my own style here:

use AppleScript version "2.4"
use framework "Foundation"

tell application "TextEdit" to set wordList to words of text of document 1

set |⌘| to current application
set originalWords to |⌘|'s class "NSArray"'s arrayWithArray:(wordList)
-- Get same-case versions of all the words.
set lowercasedWords to originalWords's valueForKey:("lowercaseString")
-- Use an NSCountedSet to count the number of each.
set countedWords to |⌘|'s class "NSCountedSet"'s setWithArray:(lowercasedWords)
-- Create an array of dictionaries, each containing a word, its count, and a foramtted string containing both.
set resultArray to |⌘|'s class "NSMutableArray"'s new()
set wordEnumerator to countedWords's objectEnumerator()
set presentationFormat to |⌘|'s class "NSString"'s stringWithString:("%@" & tab & "%@")
repeat (countedWords's |count|()) times
	set thisWord to wordEnumerator's nextObject()
	set thisCount to countedWords's countForObject:(thisWord)
	-- If this word never appears entirely lower-cased in the text, substitute the (first!) original version.
	if not ((originalWords's containsObject:(thisWord)) as boolean) then
		set firstOriginalIndex to lowercasedWords's indexOfObject:(thisWord)
		set thisWord to originalWords's objectAtIndex:(firstOriginalIndex)
	end if
	set thisString to |⌘|'s class "NSString"'s stringWithFormat_(presentationFormat, thisWord, thisCount)
	tell resultArray to addObject:({|word|:thisWord, |count|:thisCount, |string|:thisString})
end repeat
-- Reverse-sort the array on the counts, subsorting forwards on the words.
set reverseSortOnCount to |⌘|'s class "NSSortDescriptor"'s sortDescriptorWithKey:("count") ascending:(false)
set forwardSortOnWord to |⌘|'s class "NSSortDescriptor"'s sortDescriptorWithKey:("word") ascending:(true) selector:("localizedCaseInsensitiveCompare:")
tell resultArray to sortUsingDescriptors:({reverseSortOnCount, forwardSortOnWord})
-- Extract the strings, join them with linefeeds, and return as Applescript text.
return ((resultArray's valueForKey:("string"))'s componentsJoinedByString:(linefeed)) as text

More in theory than in practice, I suspect. I mean, the whole exercise has a certain lack of precision starting from the definition of words.

If someone really has used Christmas and christMas and in the reverse order, yes, they’ll get an odd result. But I think that’s really a case of GIGO.

Thanks! So much choice! This one surely catches the differences in writing.

This version is modified to use some of Nigel’s efficiencies, and changing the case logic slightly: mixed-case words will be added to the list in the case used only if they are cased consistently throughout, otherwise they will be added in lowercase.

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set theWords to words of (the clipboard)
-- make array of words and matching lowercase array
set theWords to current application's NSArray's arrayWithArray:theWords
set theWordsLower to theWords's valueForKey:"lowercaseString"
-- make counted set of words in lowercase
set theCountedSet to current application's NSCountedSet's setWithArray:theWordsLower
set rawCountedSet to current application's NSCountedSet's setWithArray:theWords
-- get array of unique words
set uniqueWords to theCountedSet's allObjects()
-- build array of dictionaries containing both the words and their counts
set theList to current application's NSMutableArray's array()
repeat with aWord in uniqueWords
	set thisCount to (theCountedSet's countForObject:aWord)
	if (theWords's containsObject:aWord) as boolean is false then
		-- the original list didn't contain the lowercase version, so look up original array
		set theIndex to (theWordsLower's indexOfObject:aWord)
		set casedWord to (theWords's objectAtIndex:theIndex)
		-- check if all instances match this case
		if (rawCountedSet's countForObject:casedWord) = thisCount then set aWord to casedWord
	end if
	set thisString to current application's NSString's stringWithFormat_("%@	%@", aWord, thisCount)
	(theList's addObject:{theWord:aWord, theCount:thisCount, theString:thisString})
end repeat
-- sort the array first by the count and second by the word
set desc1 to current application's NSSortDescriptor's sortDescriptorWithKey:"theCount" ascending:false
set desc2 to current application's NSSortDescriptor's sortDescriptorWithKey:"theWord" ascending:true selector:"localizedCaseInsensitiveCompare:"
theList's sortUsingDescriptors:{desc1, desc2}
return ((theList's valueForKey:"theString")'s componentsJoinedByString:(linefeed)) as text

Edited as per Nigel’s comments below.

Thanks Shane!

Hi Shane.

It’s been a bit difficult to follow the logic of what you’ve done as the code in this thread keeps switching back to your variable names (which I find rather cryptic) and style.

Your latest variation returns the lower-cased words sorted as expected followed by the upper- or mixed-cased words forward-sorted on the words. It turns out that this is because your theCount is being set to 0 for these words, which in turn is because they’re not in your theCountedSet.

		-- check if all instances match this case
		if (rawCountedSet's countForObject:casedWord) = thisCount then set aWord to casedWord -- aWord is now casedWord
	end if
	set thisString to current application's NSString's stringWithFormat_("%@	%@", aWord, thisCount) -- The string contains the lower-case count
	(theList's addObject:{theWord:aWord, theCount:(theCountedSet's countForObject:aWord), theString:thisString}) -- aWord isn't in theCountedSet when it's casedWord.

Edit: Since it’s still only one manifestation of each word that’s used, I think the cure is simply to use the lower-case counts as before:

		-- check if all instances match this case
		if (rawCountedSet's countForObject:casedWord) = thisCount then set aWord to casedWord
	end if
	set thisString to current application's NSString's stringWithFormat_("%@    %@", aWord, thisCount)
	(theList's addObject:{theWord:aWord, theCount:thisCount, theString:thisString}) -- Use thisCount (the total number of instances of the word in any case, from theCountedSet)
end repeat

Just a question.

What need for the parameter selector:“localizedCaseInsensitiveCompare:” ?

I tried with
selector:“localizedStandardCompare:”
selector:()
and even with the shorter syntax:
set desc2 to current application’s NSSortDescriptor’s sortDescriptorWithKey:“theWord” ascending:true

and I got exactly the same results.

Yvan KOENIG running Sierra 10.12.2 in French (VALLAURIS, France) lundi 2 janvier 2017 16:26:06

Hi Yvan.

It’s just to ensure that mixed-case words get sorted case-insensitively, otherwise all words beginning with the lower-case form of a letter will be sorted after all words beginning with the upper-case form (and which occur the same number of times, in these scripts).

As you rightly say, selector:(“localizedStandardCompare:”) could be used instead, as this “Compares strings as sorted by the Finder” ” although the documentation adds that: “The exact sorting behavior of this method is different under different locales and may be changed in future releases. This method uses the current locale.”

selector:() compiles as selector:{} but doesn’t seem to cause any problems. However, using it or the form without selector: results in a case-sensitive sort. So applying the scripts to the text “Anthony the aardvark”:

With selector:(“localizedCaseInsensitiveCompare:”) or selector:(“localizedStandardCompare:”):

[format]“aardvark 1
Anthony 1
the 1”[/format]

With selector:{} or without selector: :

[format]“Anthony 1
aardvark 1
the 1”[/format]

Thanks Nigel

It’s because Shane’s script moves the words beginning with an uppercase at the very end of the list that the selector changes nothing.
I didn’t tested but maybe it introduce changes in the script edited according to your very late proposal.

Yvan KOENIG running Sierra 10.12.2 in French (VALLAURIS, France) lundi 2 janvier 2017 17:43:23

Mea culpa. Now fixed.

So it checks whether an entry in the original list is identical to its equivalent in the lowercase list. If so, there’s nothing to do; if not, it checks if it appears the same number of times as in the lowercase version, which would mean it was always cased consistently.

Thanks, Shane.