Noob wanders desert in search of word frequency script

The goal:
take PDF > extract text > count number of times each word is used > list 3 or 4 most frequently used words (excluding “and” “the” etc)

What I’ve tried:

  • automator to extract text - success

  • Ripped off perfectly good script to get text into list and sort, credit to “Applescript, the comprehensive guide…” (see below)

Replaced automator with shaky script for skim

Problems and results:
using automator

  • gives me list of words and the number of times they show up… though not always

  • format doesn’t look right

using skim script:

  • get a list of seemingly random letters (see below)

Any input would be greatly appreciated :slight_smile:

The script below abandons automator. The result of the attached script, listed in “events” is:

tell application “Skim”
open alias “Macintosh HD:Users:stepheneder:Documents:ratios guided note taking.pdf”
→ document “ratios guided note taking.pdf”
get text for current application
→ rich text of rich text format “e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcY29jb2FydGYxMDM4XGNvY29hc3VicnRmMzUwCntcZm9udHRibH0Ke1xjb2xvcnRibDtccmVkMjU1XGdyZWVuMjU1XGJsdWUyNTU7fQp9”
open rich text of rich text format “e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcY29jb2FydGYxMDM4XGNvY29hc3VicnRmMzUwCntcZm9udHRibH0Ke1xjb2xvcnRibDtccmVkMjU1XGdyZWVuMjU1XGJsdWUyNTU7fQp9”
→ error number -1708
Result:
error “Skim got an error: rich text of rich text format "e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcY29jb2FydGYxMDM4XGNvY29hc3VicnRmMzUwCntcZm9udHRibH0Ke1xjb2xvcnRibDtccmVkMjU1XGdyZWVuMjU1XGJsdWUyNTU7fQp9" doesn’t understand the open message.” number -1708 from rich text of rich text format “e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcY29jb2FydGYxMDM4XGNvY29hc3VicnRmMzUwCntcZm9udHRibH0Ke1xjb2xvcnRibDtccmVkMjU1XGdyZWVuMjU1XGJsdWUyNTU7fQp9”

script borrowed from “AppleScript The Comprehensive Guide to Scripting and Automation on Mac OS X” by Hannah Rosenthal:


tell application “Finder”
set targetfolder to (POSIX file “/Users/Stepheneder/documents/ratios guided note taking.pdf”) as alias

tell application "Skim"
	open contents of targetfolder
	
	set pdftext to get text for
end tell

end tell

tell application “TextEdit” to open pdftext
set word_list to every word of targetfolder

set word_frequency_list to {}
repeat with the_word_ref in word_list

set the_current_word to contents of the_word_ref

set word_info to missing value


repeat with record_ref in word_frequency_list
	
	if the_word of record_ref = the_current_word then
		--assign the record to word_info, then end the search
		set word_info to contents of record_ref
		exit repeat
	end if
end repeat
-- check to see if we found an existing entry for the current word
if word_info = missing value then
	-- No matching record was found, se we create a new one
	set word_info to {the_word:the_current_word, the_count:1}
	set end of word_frequency_list to word_info
else
	
	--increment the word count
	set the_count of word_info to (the_count of word_info) + 1
end if

end repeat

return word_frequency_list

set the_report_list to {}
repeat with word_info in word_frequency_list
set end of the_report_list to quote & the_word of word_info & quote & " appears " & the_count of word_info & " times."
end repeat

set AppleScript’s text item delimiters to return
set the_report to the_report_list as text

tell application “TextEdit”
make new document with properties {name:“Word Frequencies”, text:the_report}
end tell

This seems to work:

on run
	open {choose file of type "pdf"}
end run

on open theItems
	repeat with thisItem in theItems
		if ((thisItem as text) ends with ".pdf") then main(thisItem)
	end repeat
end open

on main(pdfFile)
	script o
		property wrds : missing value
		property scores : {}
		
		-- Custom comparison handler for the sort.
		-- This one compares the end items of passed lists in such a way as to produce a reversed sort.
		on isGreater(a, b)
			(end of a < end of b)
		end isGreater
	end script
	
	tell application "Skim"
		open pdfFile
		set docName to name of front document
		set o's wrds to words of text of front document
	end tell
	
	-- Sort the list of words into groups of equal words.
	CustomShellSort(o's wrds, 1, -1, {})
	
	-- Go through the sorted list, counting the instances of each word. Store each word and its score in a list in the 'scores' list in the script object above.
	set currentWord to item 1 of o's wrds
	set c to 1
	repeat with i from 2 to (count o's wrds)
		set thisWord to item i of o's wrds
		if (thisWord is currentWord) then
			set c to c + 1
		else
			set end of o's scores to {currentWord, c}
			set currentWord to thisWord
			set c to 1
		end if
	end repeat
	set end of o's scores to {currentWord, c}
	
	-- Reverse-sort the list of word/score lists by the scores themselves.
	CustomShellSort(o's scores, 1, -1, {comparer:o})
	
	-- Report the 4 most frequently use words, if there are that many.
	set n to (count o's scores)
	if (n > 4) then set n to 4
	
	set theReport to "THE " & n & " MOST FREQUENTLY USED WORDS IN \"" & docName & "\":" & return & return
	repeat with i from 1 to n
		set x to item i of o's scores
		set theReport to theReport & "The word \"" & beginning of x & "\" appears " & end of x & " times." & return
	end repeat
	
	tell application "TextEdit" to make new document with properties {name:"Word Frequencies", text:theReport}
end main

on CustomShellSort(theList, l, r, customiser)
	script o
		property comparer : me
		property slave : me
		property lst : theList
		
		on shsrt(l, r)
			set inc to (r - l + 1) div 2
			repeat while (inc > 0)
				slave's setInc(inc)
				repeat with j from (l + inc) to r
					set v to item j of o's lst
					repeat with i from (j - inc) to l by -inc
						tell item i of o's lst
							if (comparer's isGreater(it, v)) then
								set item (i + inc) of o's lst to it
							else
								set i to i + inc
								exit repeat
							end if
						end tell
					end repeat
					set item i of o's lst to v
					slave's shift(i, j)
				end repeat
				set inc to (inc / 2.2) as integer
			end repeat
		end shsrt
		
		on isGreater(a, b)
			(a > b)
		end isGreater
		
		on shift(a, b)
		end shift
		
		on setInc(a)
		end setInc
	end script
	
	set listLen to (count theList)
	if (listLen > 1) then
		if (l < 0) then set l to listLen + l + 1
		if (r < 0) then set r to listLen + r + 1
		if (l > r) then set {l, r} to {r, l}
		
		if (customiser's class is record) then set {comparer:o's comparer, slave:o's slave} to (customiser & {comparer:o, slave:o})
		
		o's shsrt(l, r)
	end if
	
	return -- nothing.
end CustomShellSort

Here’s an idea. I use objective-c to extract the text from the pdf file. It should be the most reliable way to do that and doesn’t require any extra applications. I included a list of words to ignore at the top of the script so add words to it as you like. I also chose to not count words that are actually numbers.

At the end of this script I extract the top 5 hits. You can use some of Nigel’s techniques above to speed up this script, but I’ll leave that for you to do.

set wordsToIgnore to {"and", "the", "a", "for", "in", "is"}
set thePDF to choose file

-- get the text from the pdf
tell application "Automator Runner"
	set theURL to call method "fileURLWithPath:" of class "NSURL" with parameter (POSIX path of thePDF)
	set pdfDoc to call method "initWithURL:" of (call method "alloc" of class "PDFDocument") with parameter theURL
	set theText to call method "string" of pdfDoc
	call method "release" of pdfDoc
end tell

-- setup some variables
set theWords to words of theText -- the words to repeat over
set theWordsCount to count of theWords
set countedWords to wordsToIgnore -- track the words we have already counted
set resultsList to {} -- the list of records of the counted words

-- count the words
repeat with i from 1 to count of theWords
	set thisWord to item i of theWords
	
	-- we don't count numbers
	set isNotNumber to true
	try
		thisWord as number
		set isNotNumber to false
	end try
	
	if isNotNumber and thisWord is not in countedWords then
		set end of countedWords to thisWord
		
		-- get the word count
		set thisWordCount to 0
		repeat with j from 1 to theWordsCount
			if thisWord is (item j of theWords) then set thisWordCount to thisWordCount + 1
		end repeat
		
		-- add this word and count to the resultsList
		set end of resultsList to {thisWord, thisWordCount}
	end if
end repeat

-- sort the resultsList
set sortedList to sortListofLists(resultsList, 2)

-- extract the top 5 hits, which occur at the end of the sorted list
set topFive to {}
repeat with i from 1 to 5
	set end of topFive to item (i * -1) of sortedList
end repeat

return topFive



(********************** SUBROUTINES ***********************)
on sortListofLists(array, sortItemNum) --> this is a slight modification of the bublesort routine to make it work with a list of lists
	repeat with i from length of array to 2 by -1 -- go backwards
		repeat with j from 1 to i - 1 -- go forwards
			if (item sortItemNum of (item j of array)) > (item sortItemNum of (item (j + 1) of array)) then
				tell array to set {item j, item (j + 1)} to {item (j + 1), item j}
			end if
		end repeat
	end repeat
	return array
end sortListofLists

I am more appreciative (and relieved) that one could imagine. Both scripts worked magnificently. I have one further script query and one additional question.

In Lieu of excluding words, (ie “the” “at” “for” etc) how could I instead make my own list which the words from the text MUST INCLUDE - For example (“geometry” or “triangles” or “Mitosis” or “philosophy” etc)?

Finally, How (if it’s not so already - I haven’t yet tried) are the extracted common words usable for other functions in automator, such as adding metadata or renaming a file? In other words, can they be slapped on to the clipboard?

I thank you again, both in advance and in hindsight.

Model: MBP w/ turbo-a/c
Browser: Safari 533.19.4
Operating System: Mac OS X (10.6)