Batch convert PDF documents to plain text

Inspired by a recent post on the AppleScript forums I wrote a little AppleScript droplet that converts PDF documents to plain text using the free and versatile PDF viewer Skim.

This PDF2TXT conversion can be quite useful at times when you just want to extract the text information from a PDF file and don’t need the fancy images and stylish layout. For example, I often have to extract text from PDF catalogs and brochures sent by our suppliers to populate our various internal databases with the specific corresponding product data (CAS number, chemical synonyms, functions, etc.).

Please note that you need to install Skim on your Mac to successfully run the AppleScript, which I named Skimiks (yes, a palindrome).

The code was tested on Mac OS X 10.5.2 with Skim 1.1.2.

To use Skimiks on your fine Mac, please choose below to ‘Open this Scriplet in your Editor’ and then save it as an Application bundle.

After dropping a bunch of PDF documents on the Skimiks droplet you can choose whether you want to create one text file per page or per PDF document.

The text files (featuring UTF-8 text encoding) are always created in the parent folder of the PDF document and feature the following naming scheme:

One text file per page: «pdffilename_pageno.txt»
One text file per PDF document: «pdffilename.txt»

Existing text files are not replaced.


-- created: 04.04.2008
-- tested on:
-- ¢ Mac OS X 10.5.2
-- ¢ Skim 1.1.2

-- This script converts dropped PDF documents to plain text files with UTF-8
-- text encoding using the free PDF viewer Skim, available at <http://skim-app.sourceforge.net>.
-- 
-- You can choose whether you want to create one text file per page
-- or per PDF document.
--
-- The text files are always created in the
-- parent folder of the PDF document and feature the following naming
-- scheme: pdffilename_pageno.txt (one text file per page) or
-- pdffilename.txt (one text file per PDF document)
--
-- Existing text files are not replaced.

property mytitle : "Skimiks"

-- I am called when the user opens the script with a double-click
on run
	set infomsg to "I am a hungry AppleScript droplet!" & return & return & "Drop a bunch of PDF documents onto my icon to convert them to plain text files using Skim. The PDF files are not modified."
	my dspinfomsg(infomsg)
end run

-- I am called when the user drops Finder items onto the script icon
on open droppeditems
	set pdffilepaths to {}
	repeat with droppeditem in droppeditems
		if (droppeditem as Unicode text) ends with ".pdf" then
			set pdffilepaths to pdffilepaths & (droppeditem as Unicode text)
		end if
	end repeat
	if pdffilepaths is {} then
		set errmsg to "You did not drop any PDF documents onto me."
		my dsperrmsg(errmsg, "--")
	else
		set mode to my askformode()
		set closeskim to false
		if not my appisrunning("Skim") then
			set closeskim to true
		end if
		repeat with pdffilepath in pdffilepaths
			try
				my pdf2txt(pdffilepath, mode)
			on error errmsg number errnum
				my dsperrmsg(errmsg, errnum)
			end try
		end repeat
		if closeskim is true then
			tell application "Skim"
				quit
			end tell
		end if
	end if
end open

-- I am converting the PDF documents to plain text files using Skim
-- <http://skim-app.sourceforge.net>
on pdf2txt(pdffilepath, mode)
	set pdffileinfo to info for (pdffilepath as alias)
	set pdffilename to (characters 1 through -5 of (name of pdffileinfo)) as Unicode text
	set parentfolderpath to my getparentfolderpath(pdffilepath)
	set txtfilecreated to false
	tell application "Skim"
		open (pdffilepath as alias)
		set pdfpages to get pages for document 1
		set countpdfpages to length of pdfpages
		repeat with i from 1 to countpdfpages
			set pdfpage to item i of pdfpages
			set pagetext to get text for pdfpage
			if mode is "oneperpage" then
				set txtfilepath to parentfolderpath & pdffilename & "_" & i & ".txt"
				if not my macitempathexists(txtfilepath) then
					my writetofile(pagetext, txtfilepath, "write")
				end if
			else if mode is "oneperpdf" then
				set txtfilepath to parentfolderpath & pdffilename & ".txt"
				if i is equal to 1 then
					if not my macitempathexists(txtfilepath) then
						my writetofile(pagetext, txtfilepath, "write")
						set txtfilecreated to true
					end if
				else
					if txtfilecreated is true then
						my writetofile(pagetext, txtfilepath, "append")
					end if
				end if
			end if
		end repeat
		close document 1
	end tell
end pdf2txt

-- I am asking the user to choose a mode for the text file creation:
-- One text file per page or PDF dcoument
on askformode()
	tell me
		activate
		display dialog "Do you want to create one text file per page or one per PDF document?" buttons {"Cancel", "One per page", "One per PDF"} default button 3 with icon (POSIX file "/Applications/TextEdit.app/Contents/Resources/txt.icns") with title mytitle
		set dlgresult to result
	end tell
	if button returned of dlgresult is "One per PDF" then
		return "oneperpdf"
	else if button returned of dlgresult is "One per page" then
		return "oneperpage"
	end if
end askformode

-- I am returning the parent folder path of a given item path (Mac)
-- (thanks to Peter Fischer from scriptmymac.de for this function!)
on getparentfolderpath(itempath)
	set itempath to itempath as Unicode text
	set olddelims to AppleScript's text item delimiters
	set AppleScript's text item delimiters to ":"
	set countitems to (count text items of itempath)
	set lastitem to the last text item of itempath
	if lastitem = "" then
		set countitems to countitems - 2
	else
		set countitems to countitems - 1
	end if
	set parentfolderpath to text 1 thru text item countitems of itempath & ":"
	set AppleScript's text item delimiters to olddelims
	return parentfolderpath
end getparentfolderpath

-- I am indicating if a given item path exists
on macitempathexists(macitempath)
	try
		set macitemalias to macitempath as alias
		return true
	on error
		return false
	end try
end macitempathexists

-- I am indicating if a given application is currently running or not
-- only the (full) application name must be given, e.g. "Address Book"
on appisrunning(appname)
	tell application "System Events"
		set processnames to name of every process
	end tell
	if appname is in processnames then
		return true
	else
		return false
	end if
end appisrunning

-- I am writing given content to a given file using UTF-8 text encoding
on writetofile(cont, filepath, mode)
	try
		set openfile to open for access filepath with write permission
		if mode is "write" then
			set eof of openfile to 0
			set BOM_UTF8 to ((ASCII character 239) & (ASCII character 187) & (ASCII character 191))
			write BOM_UTF8 to openfile
		else if mode is "append" then
			set eof of openfile to (get eof of openfile)
		end if
		write cont to openfile as «class utf8» starting at eof
		close access openfile
		return true
	on error
		try
			close access openfile
		end try
		return false
	end try
end writetofile

-- I am displaying info messages
on dspinfomsg(infomsg)
	tell me
		activate
		display dialog infomsg buttons {"OK"} default button 1 with icon note with title mytitle
	end tell
end dspinfomsg

-- I am displaying error messages
on dsperrmsg(errmsg, errnum)
	set errmsg to "Sorry, an error occured:" & return & return & errmsg & " (" & errnum & ")"
	tell me
		activate
		display dialog errmsg buttons {"OK"} default button 1 with icon stop with title mytitle
	end tell
end dsperrmsg