Thursday, September 2, 2010

#1 2008-04-30 09:32:17 am

Martin Michel
Administrator
From: Berlin, Germany
Registered: 2008-03-03
Posts: 655
Website

Batch convert PDF documents to plain text

Inspired by a recent post on the AppleScript forums I wrote a little AppleScript droplet that converts PDF documents to plain text using the free and versatile PDF viewer Skim.

This PDF2TXT conversion can be quite useful at times when you just want to extract the text information from a PDF file and don't need the fancy images and stylish layout. For example, I often have to extract text from PDF catalogs and brochures sent by our suppliers to populate our various internal databases with the specific corresponding product data (CAS number, chemical synonyms, functions, etc.).

Please note that you need to install Skim on your Mac to successfully run the AppleScript, which I named Skimiks (yes, a palindrome).

The code was tested on Mac OS X 10.5.2 with Skim 1.1.2.

To use Skimiks on your fine Mac, please choose below to 'Open this Scriplet in your Editor' and then save it as an Application bundle.

After dropping a bunch of PDF documents on the Skimiks droplet you can choose whether you want to create one text file per page or per PDF document.

The text files (featuring UTF-8 text encoding) are always created in the parent folder of the PDF document and feature the following naming scheme:

One text file per page: «pdffilename_pageno.txt»
One text file per PDF document: «pdffilename.txt»

Existing text files are not replaced.

Applescript:


-- created: 04.04.2008
-- tested on:
-- • Mac OS X 10.5.2
-- • Skim 1.1.2

-- This script converts dropped PDF documents to plain text files with UTF-8
-- text encoding using the free PDF viewer Skim, available at <http://skim-app.sourceforge.net>.
--
-- You can choose whether you want to create one text file per page
-- or per PDF document.
--
-- The text files are always created in the
-- parent folder of the PDF document and feature the following naming
-- scheme: pdffilename_pageno.txt (one text file per page) or
-- pdffilename.txt (one text file per PDF document)
--
-- Existing text files are not replaced.

property mytitle : "Skimiks"

-- I am called when the user opens the script with a double-click
on run
   set infomsg to "I am a hungry AppleScript droplet!" & return & return & "Drop a bunch of PDF documents onto my icon to convert them to plain text files using Skim. The PDF files are not modified."
   my dspinfomsg(infomsg)
end run

-- I am called when the user drops Finder items onto the script icon
on open droppeditems
   set pdffilepaths to {}
   repeat with droppeditem in droppeditems
       if (droppeditem as Unicode text) ends with ".pdf" then
           set pdffilepaths to pdffilepaths & (droppeditem as Unicode text)
       end if
   end repeat
   if pdffilepaths is {} then
       set errmsg to "You did not drop any PDF documents onto me."
       my dsperrmsg(errmsg, "--")
   else
       set mode to my askformode()
       set closeskim to false
       if not my appisrunning("Skim") then
           set closeskim to true
       end if
       repeat with pdffilepath in pdffilepaths
           try
               my pdf2txt(pdffilepath, mode)
           on error errmsg number errnum
               my dsperrmsg(errmsg, errnum)
           end try
       end repeat
       if closeskim is true then
           tell application "Skim"
               quit
           end tell
       end if
   end if
end open

-- I am converting the PDF documents to plain text files using Skim
-- <http://skim-app.sourceforge.net>
on pdf2txt(pdffilepath, mode)
   set pdffileinfo to info for (pdffilepath as alias)
   set pdffilename to (characters 1 through -5 of (name of pdffileinfo)) as Unicode text
   set parentfolderpath to my getparentfolderpath(pdffilepath)
   set txtfilecreated to false
   tell application "Skim"
       open (pdffilepath as alias)
       set pdfpages to get pages for document 1
       set countpdfpages to length of pdfpages
       repeat with i from 1 to countpdfpages
           set pdfpage to item i of pdfpages
           set pagetext to get text for pdfpage
           if mode is "oneperpage" then
               set txtfilepath to parentfolderpath & pdffilename & "_" & i & ".txt"
               if not my macitempathexists(txtfilepath) then
                   my writetofile(pagetext, txtfilepath, "write")
               end if
           else if mode is "oneperpdf" then
               set txtfilepath to parentfolderpath & pdffilename & ".txt"
               if i is equal to 1 then
                   if not my macitempathexists(txtfilepath) then
                       my writetofile(pagetext, txtfilepath, "write")
                       set txtfilecreated to true
                   end if
               else
                   if txtfilecreated is true then
                       my writetofile(pagetext, txtfilepath, "append")
                   end if
               end if
           end if
       end repeat
       close document 1
   end tell
end pdf2txt

-- I am asking the user to choose a mode for the text file creation:
-- One text file per page or PDF dcoument
on askformode()
   tell me
       activate
       display dialog "Do you want to create one text file per page or one per PDF document?" buttons {"Cancel", "One per page", "One per PDF"} default button 3 with icon (POSIX file "/Applications/TextEdit.app/Contents/Resources/txt.icns") with title mytitle
       set dlgresult to result
   end tell
   if button returned of dlgresult is "One per PDF" then
       return "oneperpdf"
   else if button returned of dlgresult is "One per page" then
       return "oneperpage"
   end if
end askformode

-- I am returning the parent folder path of a given item path (Mac)
-- (thanks to Peter Fischer from scriptmymac.de for this function!)
on getparentfolderpath(itempath)
   set itempath to itempath as Unicode text
   set olddelims to AppleScript's text item delimiters
   set AppleScript's text item delimiters to ":"
   set countitems to (count text items of itempath)
   set lastitem to the last text item of itempath
   if lastitem = "" then
       set countitems to countitems - 2
   else
       set countitems to countitems - 1
   end if
   set parentfolderpath to text 1 thru text item countitems of itempath & ":"
   set AppleScript's text item delimiters to olddelims
   return parentfolderpath
end getparentfolderpath

-- I am indicating if a given item path exists
on macitempathexists(macitempath)
   try
       set macitemalias to macitempath as alias
       return true
   on error
       return false
   end try
end macitempathexists

-- I am indicating if a given application is currently running or not
-- only the (full) application name must be given, e.g. "Address Book"
on appisrunning(appname)
   tell application "System Events"
       set processnames to name of every process
   end tell
   if appname is in processnames then
       return true
   else
       return false
   end if
end appisrunning

-- I am writing given content to a given file using UTF-8 text encoding
on writetofile(cont, filepath, mode)
   try
       set openfile to open for access filepath with write permission
       if mode is "write" then
           set eof of openfile to 0
           set BOM_UTF8 to ((ASCII character 239) & (ASCII character 187) & (ASCII character 191))
           write BOM_UTF8 to openfile
       else if mode is "append" then
           set eof of openfile to (get eof of openfile)
       end if
       write cont to openfile as «class utf8» starting at eof
       close access openfile
       return true
   on error
       try
           close access openfile
       end try
       return false
   end try
end writetofile

-- I am displaying info messages
on dspinfomsg(infomsg)
   tell me
       activate
       display dialog infomsg buttons {"OK"} default button 1 with icon note with title mytitle
   end tell
end dspinfomsg

-- I am displaying error messages
on dsperrmsg(errmsg, errnum)
   set errmsg to "Sorry, an error occured:" & return & return & errmsg & " (" & errnum & ")"
   tell me
       activate
       display dialog errmsg buttons {"OK"} default button 1 with icon stop with title mytitle
   end tell
end dsperrmsg

Last edited by Martin Michel (2008-04-30 06:39:48 am)


Sal Soghoian is my role model.

Filed under: PDF, text, conversion, Skim, System

Offline

 

Board footer

Powered by FluxBB

[ Generated in 0.305 seconds, 8 queries executed ]

RSS (new topics) RSS (active topics)