Thursday, August 11, 2022

#1 2011-01-02 02:37:06 pm

monster40lbs
Member
Registered: 2010-07-31
Posts: 3

Noob wanders desert in search of word frequency script

The goal:
take PDF  >   extract text    >     count number of times each word is used     >   list  3 or 4 most frequently used words (excluding "and" "the" etc)

What I've tried:

- automator to extract text - success


- Ripped off perfectly good script to get text into list and sort, credit to "Applescript, the comprehensive guide..." (see below)

Replaced automator with shaky script for skim



Problems and results:
using automator
- gives me list of words and the number of times they show up... though not always

- format doesn't look right

using skim script:

- get a list of seemingly random letters (see below)

Any input would be greatly appreciated smile

The script below abandons automator.  The result of the attached script, listed in "events" is:

tell application "Skim"
    open alias "Macintosh HD:Users:stepheneder:Documents:ratios guided note taking.pdf"
        --> document "ratios guided note taking.pdf"
    get text for current application
        --> rich text of rich text format "e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcY29jb2FydGYxMDM4XGNvY29hc3VicnRmMzUwCntcZm9udHRibH0Ke1xjb2xvcnRibDtccmVkMjU1XGdyZWVuMjU1XGJsdWUyNTU7fQp9"
    open rich text of rich text format "e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcY29jb2FydGYxMDM4XGNvY29hc3VicnRmMzUwCntcZm9udHRibH0Ke1xjb2xvcnRibDtccmVkMjU1XGdyZWVuMjU1XGJsdWUyNTU7fQp9"
        --> error number -1708
Result:
error "Skim got an error: rich text of rich text format \"e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcY29jb2FydGYxMDM4XGNvY29hc3VicnRmMzUwCntcZm9udHRibH0Ke1xjb2xvcnRibDtccmVkMjU1XGdyZWVuMjU1XGJsdWUyNTU7fQp9\" doesn’t understand the open message." number -1708 from rich text of rich text format "e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcY29jb2FydGYxMDM4XGNvY29hc3VicnRmMzUwCntcZm9udHRibH0Ke1xjb2xvcnRibDtccmVkMjU1XGdyZWVuMjU1XGJsdWUyNTU7fQp9"



script borrowed from "AppleScript The Comprehensive Guide to Scripting and Automation on Mac OS X" by Hannah Rosenthal:

Applescript:

tell application "Finder"
    set targetfolder to (POSIX file "/Users/Stepheneder/documents/ratios guided note taking.pdf") as alias
   
    tell application "Skim"
        open contents of targetfolder
       
        set pdftext to get text for
    end tell
end tell



tell application "TextEdit" to open pdftext
set word_list to every word of targetfolder

set word_frequency_list to {}
repeat with the_word_ref in word_list
   
    set the_current_word to contents of the_word_ref
   
    set word_info to missing value
   
   
    repeat with record_ref in word_frequency_list
       
        if the_word of record_ref = the_current_word then
            --assign the record to word_info, then end the search
            set word_info to contents of record_ref
            exit repeat
        end if
    end repeat
    -- check to see if we found an existing entry for the current word
    if word_info = missing value then
        -- No matching record was found, se we create a new one
        set word_info to {the_word:the_current_word, the_count:1}
        set end of word_frequency_list to word_info
    else
       
        --increment the word count
        set the_count of word_info to (the_count of word_info) + 1
    end if
end repeat

return word_frequency_list

set the_report_list to {}
repeat with word_info in word_frequency_list
    set end of the_report_list to quote & the_word of word_info & quote & " appears " & the_count of word_info & " times."
end repeat

set AppleScript's text item delimiters to return
set the_report to the_report_list as text

tell application "TextEdit"
    make new document with properties {name:"Word Frequencies", text:the_report}
end tell

Last edited by monster40lbs (2011-01-02 06:18:45 pm)


Filed under: text, Word, Counter, frequency

Offline

 

#2 2011-01-02 06:56:37 pm

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 5552

Re: Noob wanders desert in search of word frequency script

This seems to work:

Applescript:

on run
   open {choose file of type "pdf"}
end run

on open theItems
   repeat with thisItem in theItems
       if ((thisItem as text) ends with ".pdf") then main(thisItem)
   end repeat
end open

on main(pdfFile)
   script o
       property wrds : missing value
       property scores : {}
       
       -- Custom comparison handler for the sort.
       -- This one compares the end items of passed lists in such a way as to produce a reversed sort.
       on isGreater(a, b)
           (end of a < end of b)
       end isGreater
   end script
   
   tell application "Skim"
       open pdfFile
       set docName to name of front document
       set o's wrds to words of text of front document
   end tell
   
   -- Sort the list of words into groups of equal words.
   CustomShellSort(o's wrds, 1, -1, {})
   
   -- Go through the sorted list, counting the instances of each word. Store each word and its score in a list in the 'scores' list in the script object above.
   set currentWord to item 1 of o's wrds
   set c to 1
   repeat with i from 2 to (count o's wrds)
       set thisWord to item i of o's wrds
       if (thisWord is currentWord) then
           set c to c + 1
       else
           set end of o's scores to {currentWord, c}
           set currentWord to thisWord
           set c to 1
       end if
   end repeat
   set end of o's scores to {currentWord, c}
   
   -- Reverse-sort the list of word/score lists by the scores themselves.
   CustomShellSort(o's scores, 1, -1, {comparer:o})
   
   -- Report the 4 most frequently use words, if there are that many.
   set n to (count o's scores)
   if (n > 4) then set n to 4
   
   set theReport to "THE " & n & " MOST FREQUENTLY USED WORDS IN \"" & docName & "\":" & return & return
   repeat with i from 1 to n
       set x to item i of o's scores
       set theReport to theReport & "The word \"" & beginning of x & "\" appears " & end of x & " times." & return
   end repeat
   
   tell application "TextEdit" to make new document with properties {name:"Word Frequencies", text:theReport}
end main

on CustomShellSort(theList, l, r, customiser)
   script o
       property comparer : me
       property slave : me
       property lst : theList
       
       on shsrt(l, r)
           set inc to (r - l + 1) div 2
           repeat while (inc > 0)
               slave's setInc(inc)
               repeat with j from (l + inc) to r
                   set v to item j of o's lst
                   repeat with i from (j - inc) to l by -inc
                       tell item i of o's lst
                           if (comparer's isGreater(it, v)) then
                               set item (i + inc) of o's lst to it
                           else
                               set i to i + inc
                               exit repeat
                           end if
                       end tell
                   end repeat
                   set item i of o's lst to v
                   slave's shift(i, j)
               end repeat
               set inc to (inc / 2.2) as integer
           end repeat
       end shsrt
       
       on isGreater(a, b)
           (a > b)
       end isGreater
       
       on shift(a, b)
       end shift
       
       on setInc(a)
       end setInc
   end script
   
   set listLen to (count theList)
   if (listLen > 1) then
       if (l < 0) then set l to listLen + l + 1
       if (r < 0) then set r to listLen + r + 1
       if (l > r) then set {l, r} to {r, l}
       
       if (customiser's class is record) then set {comparer:o's comparer, slave:o's slave} to (customiser & {comparer:o, slave:o})
       
       o's shsrt(l, r)
   end if
   
   return -- nothing.
end CustomShellSort


NG

Offline

 

#3 2011-01-02 07:41:03 pm

regulus6633
Member
From:: Taulov, Denmark
Registered: 2006-11-02
Posts: 1695
Website

Re: Noob wanders desert in search of word frequency script

Here's an idea. I use objective-c to extract the text from the pdf file. It should be the most reliable way to do that and doesn't require any extra applications. I included a list of words to ignore at the top of the script so add words to it as you like. I also chose to not count words that are actually numbers.

At the end of this script I extract the top 5 hits. You can use some of Nigel's techniques above to speed up this script, but I'll leave that for you to do.

Applescript:

set wordsToIgnore to {"and", "the", "a", "for", "in", "is"}
set thePDF to choose file

-- get the text from the pdf
tell application "Automator Runner"
   set theURL to call method "fileURLWithPath:" of class "NSURL" with parameter (POSIX path of thePDF)
   set pdfDoc to call method "initWithURL:" of (call method "alloc" of class "PDFDocument") with parameter theURL
   set theText to call method "string" of pdfDoc
   call method "release" of pdfDoc
end tell

-- setup some variables
set theWords to words of theText -- the words to repeat over
set theWordsCount to count of theWords
set countedWords to wordsToIgnore -- track the words we have already counted
set resultsList to {} -- the list of records of the counted words

-- count the words
repeat with i from 1 to count of theWords
   set thisWord to item i of theWords
   
   -- we don't count numbers
   set isNotNumber to true
   try
       thisWord as number
       set isNotNumber to false
   end try
   
   if isNotNumber and thisWord is not in countedWords then
       set end of countedWords to thisWord
       
       -- get the word count
       set thisWordCount to 0
       repeat with j from 1 to theWordsCount
           if thisWord is (item j of theWords) then set thisWordCount to thisWordCount + 1
       end repeat
       
       -- add this word and count to the resultsList
       set end of resultsList to {thisWord, thisWordCount}
   end if
end repeat

-- sort the resultsList
set sortedList to sortListofLists(resultsList, 2)

-- extract the top 5 hits, which occur at the end of the sorted list
set topFive to {}
repeat with i from 1 to 5
   set end of topFive to item (i * -1) of sortedList
end repeat

return topFive



(********************** SUBROUTINES ***********************)
on sortListofLists(array, sortItemNum) --> this is a slight modification of the bublesort routine to make it work with a list of lists
   repeat with i from length of array to 2 by -1 -- go backwards
       repeat with j from 1 to i - 1 -- go forwards
           if (item sortItemNum of (item j of array)) > (item sortItemNum of (item (j + 1) of array)) then
               tell array to set {item j, item (j + 1)} to {item (j + 1), item j}
           end if
       end repeat
   end repeat
   return array
end sortListofLists

Last edited by regulus6633 (2011-01-02 08:09:32 pm)

Offline

 

#4 2011-01-03 11:01:29 am

monster40lbs
Member
Registered: 2010-07-31
Posts: 3

Re: Noob wanders desert in search of word frequency script

I am more appreciative (and relieved) that one could imagine.  Both scripts worked magnificently.  I have one further script query and one additional question.

In Lieu of excluding words, (ie "the" "at" "for" etc) how could I instead make my own list which the words from the text MUST INCLUDE - For example ("geometry" or "triangles" or "Mitosis" or "philosophy" etc)?

Finally, How (if it's not so already - I haven't yet tried) are the extracted common words usable for other functions in automator, such as adding metadata or renaming a file?  In other words, can they be slapped on to the clipboard?

I thank you again, both in advance and in hindsight.

Model: MBP w/ turbo-a/c
Browser: Safari 533.19.4
Operating System: Mac OS X (10.6)


Filed under: text, thanks, Words, frequency

Offline

 

Board footer

Powered by FluxBB

RSS (new topics) RSS (active topics)