How to check if PDF file has OCR?

membranatus · April 4, 2009, 10:59pm

Hi all!
I have a huge collection of PDF files (+200) and it may get bigger with time.
Some of this files have been OCRized and are great to read and very lightweight since they have been optimized. Some of them are pure images of scanned pages and are huge and very slow to handle. And another some have been OCRized, but not optimized, so they are as slow and heavy as non OCR PDFs.
So, my question is, how can I make a script that checks this files to see if they have or haven’t got embeded type (meaning OCR, right?) and then label those files somehow (I was thinking about OSX color labels; something like Red for nonOCR files, Orange for OCR but heavy files, Green for OCR and optimized files).

I think I can handle everything on this script except one little thing: the OCRized file recognition.

I’ve looked on the ImageEvent’s, Preview’s, Adobe Acrobat Pro’s and Adobe Reader’s dictionaries, but I found nothing.
Any ideas on this?

Thanks to all!

Mark67 · April 6, 2009, 9:26am

Im not even sure if Im close to being on the right track here. So you would need to take a good look at your files to check.
I think if you have used the OCR in Acrobat then the file’s encoder becomes “Paper Capture Plug-In” I only did a very small test here.
The rest you would just need to look at the file size.

set This_PDF to choose file without invisibles
--
if PDF_Encoder(This_PDF) then
	tell application "Adobe Acrobat 7.0 Professional"
		activate
		open This_PDF
		tell active doc
			(* with timeout of 180 seconds
				tell application "System Events"
					tell process "Acrobat"
						tell menu bar 1
							tell menu bar item "Document"
								tell menu 1
									tell menu item "Recognize Text Using OCR"
										tell menu 1
											click menu item "Start..."
											delay 1
											keystroke return
										end tell
									end tell
								end tell
							end tell
						end tell
					end tell
				end tell
			end timeout *)
			close saving yes
		end tell
	end tell
end if
--
on PDF_Encoder(This_PDF)
	try
		do shell script "/usr/bin/mdls -name kMDItemEncodingApplications" & space & quoted form of POSIX path of This_PDF
		if the result contains "Paper Capture Plug-In" then
			-- This file has OCR text???
			return false
		else
			-- This file does NOT have OCR text??? Open it.
			return true
		end if
	on error
		-- Write Log
	end try
end PDF_Encoder

membranatus · April 11, 2009, 4:54pm

Hi,
I did several test with your script (I added a Finder labeling part), but it just painted every file (both OCR and nonOCR).
The thing is I didn’t scanned the files, they are a collection of files downloaded from different places, so I don’t know what software was used to do this.
Thats why I thought that to check if a certain PDF file had embeded type or not, was the standard thing that was gonna sort the files.

I still haven’t found a way of finding a way to do this…

Any more ideas guys?

Thanks!!!