Inspired by a recent post on the AppleScript forums I wrote a little AppleScript droplet that converts PDF documents to plain text using the free and versatile PDF viewer Skim.
This PDF2TXT conversion can be quite useful at times when you just want to extract the text information from a PDF file and don’t need the fancy images and stylish layout. For example, I often have to extract text from PDF catalogs and brochures sent by our suppliers to populate our various internal databases with the specific corresponding product data (CAS number, chemical synonyms, functions, etc.).
Please note that you need to install Skim on your Mac to successfully run the AppleScript, which I named Skimiks (yes, a palindrome).
The code was tested on Mac OS X 10.5.2 with Skim 1.1.2.
To use Skimiks on your fine Mac, please choose below to ‘Open this Scriplet in your Editor’ and then save it as an Application bundle.
After dropping a bunch of PDF documents on the Skimiks droplet you can choose whether you want to create one text file per page or per PDF document.
The text files (featuring UTF-8 text encoding) are always created in the parent folder of the PDF document and feature the following naming scheme:
One text file per page: «pdffilename_pageno.txt»
One text file per PDF document: «pdffilename.txt»
Existing text files are not replaced.
-- created: 04.04.2008
-- tested on:
-- ¢ Mac OS X 10.5.2
-- ¢ Skim 1.1.2
-- This script converts dropped PDF documents to plain text files with UTF-8
-- text encoding using the free PDF viewer Skim, available at <http://skim-app.sourceforge.net>.
--
-- You can choose whether you want to create one text file per page
-- or per PDF document.
--
-- The text files are always created in the
-- parent folder of the PDF document and feature the following naming
-- scheme: pdffilename_pageno.txt (one text file per page) or
-- pdffilename.txt (one text file per PDF document)
--
-- Existing text files are not replaced.
property mytitle : "Skimiks"
-- I am called when the user opens the script with a double-click
on run
set infomsg to "I am a hungry AppleScript droplet!" & return & return & "Drop a bunch of PDF documents onto my icon to convert them to plain text files using Skim. The PDF files are not modified."
my dspinfomsg(infomsg)
end run
-- I am called when the user drops Finder items onto the script icon
on open droppeditems
set pdffilepaths to {}
repeat with droppeditem in droppeditems
if (droppeditem as Unicode text) ends with ".pdf" then
set pdffilepaths to pdffilepaths & (droppeditem as Unicode text)
end if
end repeat
if pdffilepaths is {} then
set errmsg to "You did not drop any PDF documents onto me."
my dsperrmsg(errmsg, "--")
else
set mode to my askformode()
set closeskim to false
if not my appisrunning("Skim") then
set closeskim to true
end if
repeat with pdffilepath in pdffilepaths
try
my pdf2txt(pdffilepath, mode)
on error errmsg number errnum
my dsperrmsg(errmsg, errnum)
end try
end repeat
if closeskim is true then
tell application "Skim"
quit
end tell
end if
end if
end open
-- I am converting the PDF documents to plain text files using Skim
-- <http://skim-app.sourceforge.net>
on pdf2txt(pdffilepath, mode)
set pdffileinfo to info for (pdffilepath as alias)
set pdffilename to (characters 1 through -5 of (name of pdffileinfo)) as Unicode text
set parentfolderpath to my getparentfolderpath(pdffilepath)
set txtfilecreated to false
tell application "Skim"
open (pdffilepath as alias)
set pdfpages to get pages for document 1
set countpdfpages to length of pdfpages
repeat with i from 1 to countpdfpages
set pdfpage to item i of pdfpages
set pagetext to get text for pdfpage
if mode is "oneperpage" then
set txtfilepath to parentfolderpath & pdffilename & "_" & i & ".txt"
if not my macitempathexists(txtfilepath) then
my writetofile(pagetext, txtfilepath, "write")
end if
else if mode is "oneperpdf" then
set txtfilepath to parentfolderpath & pdffilename & ".txt"
if i is equal to 1 then
if not my macitempathexists(txtfilepath) then
my writetofile(pagetext, txtfilepath, "write")
set txtfilecreated to true
end if
else
if txtfilecreated is true then
my writetofile(pagetext, txtfilepath, "append")
end if
end if
end if
end repeat
close document 1
end tell
end pdf2txt
-- I am asking the user to choose a mode for the text file creation:
-- One text file per page or PDF dcoument
on askformode()
tell me
activate
display dialog "Do you want to create one text file per page or one per PDF document?" buttons {"Cancel", "One per page", "One per PDF"} default button 3 with icon (POSIX file "/Applications/TextEdit.app/Contents/Resources/txt.icns") with title mytitle
set dlgresult to result
end tell
if button returned of dlgresult is "One per PDF" then
return "oneperpdf"
else if button returned of dlgresult is "One per page" then
return "oneperpage"
end if
end askformode
-- I am returning the parent folder path of a given item path (Mac)
-- (thanks to Peter Fischer from scriptmymac.de for this function!)
on getparentfolderpath(itempath)
set itempath to itempath as Unicode text
set olddelims to AppleScript's text item delimiters
set AppleScript's text item delimiters to ":"
set countitems to (count text items of itempath)
set lastitem to the last text item of itempath
if lastitem = "" then
set countitems to countitems - 2
else
set countitems to countitems - 1
end if
set parentfolderpath to text 1 thru text item countitems of itempath & ":"
set AppleScript's text item delimiters to olddelims
return parentfolderpath
end getparentfolderpath
-- I am indicating if a given item path exists
on macitempathexists(macitempath)
try
set macitemalias to macitempath as alias
return true
on error
return false
end try
end macitempathexists
-- I am indicating if a given application is currently running or not
-- only the (full) application name must be given, e.g. "Address Book"
on appisrunning(appname)
tell application "System Events"
set processnames to name of every process
end tell
if appname is in processnames then
return true
else
return false
end if
end appisrunning
-- I am writing given content to a given file using UTF-8 text encoding
on writetofile(cont, filepath, mode)
try
set openfile to open for access filepath with write permission
if mode is "write" then
set eof of openfile to 0
set BOM_UTF8 to ((ASCII character 239) & (ASCII character 187) & (ASCII character 191))
write BOM_UTF8 to openfile
else if mode is "append" then
set eof of openfile to (get eof of openfile)
end if
write cont to openfile as «class utf8» starting at eof
close access openfile
return true
on error
try
close access openfile
end try
return false
end try
end writetofile
-- I am displaying info messages
on dspinfomsg(infomsg)
tell me
activate
display dialog infomsg buttons {"OK"} default button 1 with icon note with title mytitle
end tell
end dspinfomsg
-- I am displaying error messages
on dsperrmsg(errmsg, errnum)
set errmsg to "Sorry, an error occured:" & return & return & errmsg & " (" & errnum & ")"
tell me
activate
display dialog errmsg buttons {"OK"} default button 1 with icon stop with title mytitle
end tell
end dsperrmsg