Using result of PDFTOTEXT shell script

I installed the unix shell script PDFTOTEXT, on my OSX Yosemite macbook. It works fine, converting PDF files to TEXT.

My problem is how to use the converted TEXT file in the next step of my script.
Here is what I have for the conversion part:

set myfile to choose file with prompt “Choose PDF to convert to TEXT”

set myfile to myfile as alias

do shell script "/usr/local/bin/pdftotext " & quoted form of POSIX path of myfile

The above creates the text file and puts it in the directory from where the pdf file was chosen, but I cannot “grab” the text file to use it in the next step, which is a grep search.

I tried to set an object for the result:

set mytextfile to do shell script "/usr/local/bin/pdftotext " & quoted form of POSIX path of myfile

This did not work, nothing was returned. Any suggestions?

Thanks, Bob R.

Are you sure that myfile in your second step is not the same as myfile in the first part?

Don’t have pdftotext installed but would this be better?

First attempt :

set myfile to choose file with prompt "Choose PDF to convert to TEXT"

set myfile to myfile as alias # Useless, myfile is already an alias
set newFile to myfile as text
set newFile to text 1 thru -4 of newFile & "txt"

do shell script "/usr/local/bin/pdftotext " & quoted form of POSIX path of myfile
set theText to read file newFile

Here it works.

But maybe an alternate scheme using ASObjC (handler borrowed to Shane STANLEY) maybe a better answer.

use AppleScript version "2.4"
use framework "Foundation"
use framework "Quartz"
use scripting additions

on getTextFromPDF:posixPath
	set theURL to current application's |NSURL|'s fileURLWithPath:posixPath
	set thePDF to current application's PDFDocument's alloc()'s initWithURL:theURL
	return (thePDF's |string|()) as text
end getTextFromPDF:
set theText to its getTextFromPDF:(POSIX path of (choose file of type {"pdf"}))

ASObjC give us the ability to use Regex too.

Yvan KOENIG running Yosemite 10.10.5 in French (VALLAURIS, France) mercredi 2 septembre 2015 10:59:37

It’s a little more complicated – you have to loop through the pages, getting |string|() of each page. Something like:

use AppleScript version "2.4"
use framework "Foundation"
use framework "Quartz"
use scripting additions

on getTextFromPDF:posixPath
	set theURL to current application's |NSURL|'s fileURLWithPath:posixPath
	set thePDF to current application's PDFDocument's alloc()'s initWithURL:theURL
	set noOfPages to thePDF's pageCount()
	set theString to ""
	repeat with i from 1 to noOfPages
		set onePage to (thePDF's pageAtIndex:(i - 1)) -- zero-based indexes
		set theString to theString & (onePage's |string|()) as text
	end repeat
	return theString
end getTextFromPDF:
set theText to its getTextFromPDF:(POSIX path of (choose file of type {"pdf"}))

Edit: Actually, Yvan’s code works fine. I’ll leave this here as an example of how to get the contents of particular pages.

Hello Shane.

I double checked, the code which I posted extracted correctly the entire text from the AppleScript Language Guide pdf.

tell application “Script Editor”
choose file of type {“pdf”}
end tell
Résultat :
"AppleScript Language Guide

Contents
Introduction to AppleScript Language Guide 12 What Is AppleScript? 12
Who Should Read This Document? 13
Organization of This Document 13
Conventions Used in This Guide 14 See Also 15
AppleScript Lexical Conventions 16 Character Set 16
Identifiers 17
Keywords 17
Comments 19
The Continuation Character 19 Literals and Constants 20
Boolean 20 Constant 20 List 20 Number 20 Record 21 Text 21
Operators 21 Variables 22 Expressions 22 Statements 23 Commands 23 Results 24 Raw Codes 24
AppleScript Fundamentals 25 Script Editor Application 25 AppleScript and Objects 27
What Is in a Script Object 27 Properties 29
Elements 29
2013-10-22
| Copyright © 2013 Apple Inc. All Rights Reserved. 2
.
Index
of command say 196
web page (unsupported) 310 weekdayproperty 106
where reserved word 214, 215 while reserved word 256 white space attribute 245 white space constants 126 whose reserved word 215 whose
synonyms for 214 withclause 280
with icon parameter
of command display dialog 160 with parameters parameter
of command run script 194 with password parameter
of command mount volume 176 with prompt parameter
of command choose application 140 of command choose file 142
of command choose file name 144 of command choose folder 145
of command choose from list 147
of command choose remote application 149 with seed parameter
of command random number 188 with timeout control statement 272 with timeout statements 271, 273 with title parameter
of command choose application 140
of command choose from list 147
of command choose remote application 149 of command display dialog 160
of command display notification 162
with transaction control statement 273 withoutclause 280
wordelement 125
working with errors 301
writecommand 209
write permission parameter
of command open for access 178
Y
yearproperty 107
2013-10-22
| Copyright © 2013 Apple Inc. All Rights Reserved. 332

Characters: 434925
Spaces: 91341
Total: 526266

Words: 82354
Lines: 12946

Puzzling detail, when I select all in Preview, I got different statistics :

Characters: 432901
Spaces: 91342
Total: 524243

Words: 82354
Lines: 12946

I found the explanation to the difference. Some characters like backslash or double quotes existing in the pdf are extracted with a preceding backslash by the script.

Yvan KOENIG running Yosemite 10.10.5 in French (VALLAURIS, France) mercredi 2 septembre 2015 15:55:55

set myfile to choose file with prompt “Choose PDF to convert to TEXT”

set myfile to myfile as alias # Useless, myfile is already an alias
set newFile to myfile as text
set newFile to text 1 thru -4 of newFile & “txt”

do shell script "/usr/local/bin/pdftotext " & quoted form of POSIX path of myfile
set theText to read file newFile

Thank you, Yvan.
You are correct about the redundant alias command…my bad.

What is the function of this command in your script?

----set newFile to text 1 thru -4 of newFile & “txt”

Bob Rutledge

If the original path is “path:to:the:pdf:file.pdf
the instruction replace the extension “pdf” by “txt” to build “path:to:the:pdf:file.txt” which is the path to the text file created by the shell command.
So, the script is able to read this created text file.

In fact, the script does exactly what would make the length version :

set myfile to choose file with prompt "Choose PDF to convert to TEXT"

set myfile to myfile as alias # Useless, myfile is already an alias
set newFile to myfile as text
set newFile to text 1 thru -4 of newFile & "txt"

do shell script "/usr/local/bin/pdftotext " & quoted form of POSIX path of myfile & " " & quoted form of POSIX path of newFile
set theText to read file newFile

I’m quite sure that we may use the cat command to the given shell instruction to read the text file but I’m not a shell expert so I ignore the correct syntax.

Below, I use also the shell to read the text file but as I wrote, I don’t know how gather the two shell instructions in a single one.

set myfile to choose file with prompt "Choose PDF to convert to TEXT"

set myfile to myfile as alias # Useless, myfile is already an alias
set newFile to myfile as text
set newFile to text 1 thru -4 of newFile & "txt"
set qPosixNewFile to quoted form of POSIX path of newFile
do shell script "/usr/local/bin/pdftotext " & quoted form of POSIX path of myfile & " " & qPosixNewFile
set theText to do shell script "cat " & qPosixNewFile

Yvan KOENIG running Yosemite 10.10.5 in French (VALLAURIS, France) mercredi 2 septembre 2015 18:34:18

I should have known you would have tested the code before posting. You are correct; you’re original code works fine.

It’s not my code.
As I wrote, it was borrowed from you. To be precise it’s in page 124 of Everyday AppleScriptObjC :wink:

Yvan KOENIG running Yosemite 10.10.5 in French (VALLAURIS, France) jeudi 3 septembre 2015 12:09:18

My memory is getting worse :frowning:

Anyway, it got me off checking some documentation, and I realised that the contents can also be exported as .rtf:

use framework "Foundation"
use framework "Quartz"
use scripting additions

on exportRTFFromPDF:posixPath
	-- make source and destination URLs
	set theURL to current application's |NSURL|'s fileURLWithPath:posixPath
	set destURL to theURL's URLByDeletingPathExtension()'s URLByAppendingPathExtension:"rtf"
	-- make PDF document
	set thePDF to current application's PDFDocument's alloc()'s initWithURL:theURL
	-- set the entire contents as a styled string
	set theAttributedString to thePDF's selectionForEntireDocument()'s attributedString()
	-- save it to .rtf
	set theLength to theAttributedString's |length|()
	set docAttributes to current application's NSDictionary's dictionaryWithObject:(current application's NSRTFTextDocumentType) forKey:(current application's NSDocumentTypeDocumentAttribute)
	set rtfData to theAttributedString's RTFFromRange:(current application's NSMakeRange(0, theLength)) documentAttributes:docAttributes
	rtfData's writeToURL:destURL atomically:true
end exportRTFFromPDF:

Thanks for this new interesting handler.

Yvan KOENIG running Yosemite 10.10.5 in French (VALLAURIS, France) jeudi 3 septembre 2015 17:00:45

Hey Bob,

The best thing to do is extract the text to STDOUT.

pdftotext -layout <path> -

See the hyphen on the end? That’s what does it.

Otherwise pdftotext defaults to creating a new text file next to the existing PDF file.


-----------------------------------------------------------------------
set testFile to "~/test_directory/pdf_test_files/test.pdf"
-----------------------------------------------------------------------
tell application "System Events" to set testFile to POSIX path of disk item testFile
set shCMD to text 2 thru -1 of "
export PATH=/opt/local/bin:/opt/local/sbin:/usr/local/bin:$PATH;
pdftotext -layout " & (quoted form of testFile) & " - ;
"
set convertedText to do shell script shCMD
-----------------------------------------------------------------------

This lets you do whatever you want with the extracted text, and it preserves the layout far better than ASObjC.

It also marks the pages, so you can process them individually if needed (there’s a switch described in the man page to turn that off if desired).


Chris


{ MacBookPro6,1 · 2.66 GHz Intel Core i7 · 8GB RAM · OSX 10.11.1 }
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Thanks for your reply, Chris.
I do not fully understand how to use the result of STDOUT; it is a subject I will have to read up on.

I saw that as an option in the man pdftotext file, but still have trouble understanding where the actual product of STDOUT is and how to grab and use it in the next step.

Bob

Hey Bob,

When run from the Terminal STDOUT returns the output back to the Terminal.

When run from an AppleScript do shell script command it will be returned to AppleScript as a result.

set convertedText to do shell script shCMD

Here I’m showing you EXACTLY how to get the output into a variable for further processing.

-Chris

That works, Chris. Many thanks!

Bob R.