The goal:
take PDF > extract text > count number of times each word is used > list 3 or 4 most frequently used words (excluding “and” “the” etc)
What I’ve tried:
-
automator to extract text - success
-
Ripped off perfectly good script to get text into list and sort, credit to “Applescript, the comprehensive guide…” (see below)
Replaced automator with shaky script for skim
Problems and results:
using automator
-
gives me list of words and the number of times they show up… though not always
-
format doesn’t look right
using skim script:
- get a list of seemingly random letters (see below)
Any input would be greatly appreciated
The script below abandons automator. The result of the attached script, listed in “events” is:
tell application “Skim”
open alias “Macintosh HD:Users:stepheneder:Documents:ratios guided note taking.pdf”
→ document “ratios guided note taking.pdf”
get text for current application
→ rich text of rich text format “e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcY29jb2FydGYxMDM4XGNvY29hc3VicnRmMzUwCntcZm9udHRibH0Ke1xjb2xvcnRibDtccmVkMjU1XGdyZWVuMjU1XGJsdWUyNTU7fQp9”
open rich text of rich text format “e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcY29jb2FydGYxMDM4XGNvY29hc3VicnRmMzUwCntcZm9udHRibH0Ke1xjb2xvcnRibDtccmVkMjU1XGdyZWVuMjU1XGJsdWUyNTU7fQp9”
→ error number -1708
Result:
error “Skim got an error: rich text of rich text format "e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcY29jb2FydGYxMDM4XGNvY29hc3VicnRmMzUwCntcZm9udHRibH0Ke1xjb2xvcnRibDtccmVkMjU1XGdyZWVuMjU1XGJsdWUyNTU7fQp9" doesn’t understand the open message.” number -1708 from rich text of rich text format “e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcY29jb2FydGYxMDM4XGNvY29hc3VicnRmMzUwCntcZm9udHRibH0Ke1xjb2xvcnRibDtccmVkMjU1XGdyZWVuMjU1XGJsdWUyNTU7fQp9”
script borrowed from “AppleScript The Comprehensive Guide to Scripting and Automation on Mac OS X” by Hannah Rosenthal:
tell application “Finder”
set targetfolder to (POSIX file “/Users/Stepheneder/documents/ratios guided note taking.pdf”) as alias
tell application "Skim"
open contents of targetfolder
set pdftext to get text for
end tell
end tell
tell application “TextEdit” to open pdftext
set word_list to every word of targetfolder
set word_frequency_list to {}
repeat with the_word_ref in word_list
set the_current_word to contents of the_word_ref
set word_info to missing value
repeat with record_ref in word_frequency_list
if the_word of record_ref = the_current_word then
--assign the record to word_info, then end the search
set word_info to contents of record_ref
exit repeat
end if
end repeat
-- check to see if we found an existing entry for the current word
if word_info = missing value then
-- No matching record was found, se we create a new one
set word_info to {the_word:the_current_word, the_count:1}
set end of word_frequency_list to word_info
else
--increment the word count
set the_count of word_info to (the_count of word_info) + 1
end if
end repeat
return word_frequency_list
set the_report_list to {}
repeat with word_info in word_frequency_list
set end of the_report_list to quote & the_word of word_info & quote & " appears " & the_count of word_info & " times."
end repeat
set AppleScript’s text item delimiters to return
set the_report to the_report_list as text
tell application “TextEdit”
make new document with properties {name:“Word Frequencies”, text:the_report}
end tell