I installed the unix shell script PDFTOTEXT, on my OSX Yosemite macbook. It works fine, converting PDF files to TEXT.
My problem is how to use the converted TEXT file in the next step of my script.
Here is what I have for the conversion part:
set myfile to choose file with prompt “Choose PDF to convert to TEXT”
set myfile to myfile as alias
do shell script "/usr/local/bin/pdftotext " & quoted form of POSIX path of myfile
The above creates the text file and puts it in the directory from where the pdf file was chosen, but I cannot “grab” the text file to use it in the next step, which is a grep search.
I tried to set an object for the result:
set mytextfile to do shell script "/usr/local/bin/pdftotext " & quoted form of POSIX path of myfile
This did not work, nothing was returned. Any suggestions?
set myfile to choose file with prompt "Choose PDF to convert to TEXT"
set myfile to myfile as alias # Useless, myfile is already an alias
set newFile to myfile as text
set newFile to text 1 thru -4 of newFile & "txt"
do shell script "/usr/local/bin/pdftotext " & quoted form of POSIX path of myfile
set theText to read file newFile
Here it works.
But maybe an alternate scheme using ASObjC (handler borrowed to Shane STANLEY) maybe a better answer.
use AppleScript version "2.4"
use framework "Foundation"
use framework "Quartz"
use scripting additions
on getTextFromPDF:posixPath
set theURL to current application's |NSURL|'s fileURLWithPath:posixPath
set thePDF to current application's PDFDocument's alloc()'s initWithURL:theURL
return (thePDF's |string|()) as text
end getTextFromPDF:
set theText to its getTextFromPDF:(POSIX path of (choose file of type {"pdf"}))
ASObjC give us the ability to use Regex too.
Yvan KOENIG running Yosemite 10.10.5 in French (VALLAURIS, France) mercredi 2 septembre 2015 10:59:37
It’s a little more complicated – you have to loop through the pages, getting |string|() of each page. Something like:
use AppleScript version "2.4"
use framework "Foundation"
use framework "Quartz"
use scripting additions
on getTextFromPDF:posixPath
set theURL to current application's |NSURL|'s fileURLWithPath:posixPath
set thePDF to current application's PDFDocument's alloc()'s initWithURL:theURL
set noOfPages to thePDF's pageCount()
set theString to ""
repeat with i from 1 to noOfPages
set onePage to (thePDF's pageAtIndex:(i - 1)) -- zero-based indexes
set theString to theString & (onePage's |string|()) as text
end repeat
return theString
end getTextFromPDF:
set theText to its getTextFromPDF:(POSIX path of (choose file of type {"pdf"}))
Edit: Actually, Yvan’s code works fine. I’ll leave this here as an example of how to get the contents of particular pages.
Puzzling detail, when I select all in Preview, I got different statistics :
Characters: 432901
Spaces: 91342
Total: 524243
Words: 82354
Lines: 12946
I found the explanation to the difference. Some characters like backslash or double quotes existing in the pdf are extracted with a preceding backslash by the script.
Yvan KOENIG running Yosemite 10.10.5 in French (VALLAURIS, France) mercredi 2 septembre 2015 15:55:55
If the original path is “path:to:the:pdf:file.pdf”
the instruction replace the extension “pdf” by “txt” to build “path:to:the:pdf:file.txt” which is the path to the text file created by the shell command.
So, the script is able to read this created text file.
In fact, the script does exactly what would make the length version :
set myfile to choose file with prompt "Choose PDF to convert to TEXT"
set myfile to myfile as alias # Useless, myfile is already an alias
set newFile to myfile as text
set newFile to text 1 thru -4 of newFile & "txt"
do shell script "/usr/local/bin/pdftotext " & quoted form of POSIX path of myfile & " " & quoted form of POSIX path of newFile
set theText to read file newFile
I’m quite sure that we may use the cat command to the given shell instruction to read the text file but I’m not a shell expert so I ignore the correct syntax.
Below, I use also the shell to read the text file but as I wrote, I don’t know how gather the two shell instructions in a single one.
set myfile to choose file with prompt "Choose PDF to convert to TEXT"
set myfile to myfile as alias # Useless, myfile is already an alias
set newFile to myfile as text
set newFile to text 1 thru -4 of newFile & "txt"
set qPosixNewFile to quoted form of POSIX path of newFile
do shell script "/usr/local/bin/pdftotext " & quoted form of POSIX path of myfile & " " & qPosixNewFile
set theText to do shell script "cat " & qPosixNewFile
Yvan KOENIG running Yosemite 10.10.5 in French (VALLAURIS, France) mercredi 2 septembre 2015 18:34:18
Anyway, it got me off checking some documentation, and I realised that the contents can also be exported as .rtf:
use framework "Foundation"
use framework "Quartz"
use scripting additions
on exportRTFFromPDF:posixPath
-- make source and destination URLs
set theURL to current application's |NSURL|'s fileURLWithPath:posixPath
set destURL to theURL's URLByDeletingPathExtension()'s URLByAppendingPathExtension:"rtf"
-- make PDF document
set thePDF to current application's PDFDocument's alloc()'s initWithURL:theURL
-- set the entire contents as a styled string
set theAttributedString to thePDF's selectionForEntireDocument()'s attributedString()
-- save it to .rtf
set theLength to theAttributedString's |length|()
set docAttributes to current application's NSDictionary's dictionaryWithObject:(current application's NSRTFTextDocumentType) forKey:(current application's NSDocumentTypeDocumentAttribute)
set rtfData to theAttributedString's RTFFromRange:(current application's NSMakeRange(0, theLength)) documentAttributes:docAttributes
rtfData's writeToURL:destURL atomically:true
end exportRTFFromPDF:
The best thing to do is extract the text to STDOUT.
pdftotext -layout <path> -
See the hyphen on the end? That’s what does it.
Otherwise pdftotext defaults to creating a new text file next to the existing PDF file.
-----------------------------------------------------------------------
set testFile to "~/test_directory/pdf_test_files/test.pdf"
-----------------------------------------------------------------------
tell application "System Events" to set testFile to POSIX path of disk item testFile
set shCMD to text 2 thru -1 of "
export PATH=/opt/local/bin:/opt/local/sbin:/usr/local/bin:$PATH;
pdftotext -layout " & (quoted form of testFile) & " - ;
"
set convertedText to do shell script shCMD
-----------------------------------------------------------------------
This lets you do whatever you want with the extracted text, and it preserves the layout far better than ASObjC.
It also marks the pages, so you can process them individually if needed (there’s a switch described in the man page to turn that off if desired).
Thanks for your reply, Chris.
I do not fully understand how to use the result of STDOUT; it is a subject I will have to read up on.
I saw that as an option in the man pdftotext file, but still have trouble understanding where the actual product of STDOUT is and how to grab and use it in the next step.