Hi all!
I have a huge collection of PDF files (+200) and it may get bigger with time.
Some of this files have been OCRized and are great to read and very lightweight since they have been optimized. Some of them are pure images of scanned pages and are huge and very slow to handle. And another some have been OCRized, but not optimized, so they are as slow and heavy as non OCR PDFs.
So, my question is, how can I make a script that checks this files to see if they have or haven’t got embeded type (meaning OCR, right?) and then label those files somehow (I was thinking about OSX color labels; something like Red for nonOCR files, Orange for OCR but heavy files, Green for OCR and optimized files).
I think I can handle everything on this script except one little thing: the OCRized file recognition.
I’ve looked on the ImageEvent’s, Preview’s, Adobe Acrobat Pro’s and Adobe Reader’s dictionaries, but I found nothing.
Any ideas on this?
Im not even sure if Im close to being on the right track here. So you would need to take a good look at your files to check.
I think if you have used the OCR in Acrobat then the file’s encoder becomes “Paper Capture Plug-In” I only did a very small test here.
The rest you would just need to look at the file size.
set This_PDF to choose file without invisibles
--
if PDF_Encoder(This_PDF) then
tell application "Adobe Acrobat 7.0 Professional"
activate
open This_PDF
tell active doc
(* with timeout of 180 seconds
tell application "System Events"
tell process "Acrobat"
tell menu bar 1
tell menu bar item "Document"
tell menu 1
tell menu item "Recognize Text Using OCR"
tell menu 1
click menu item "Start..."
delay 1
keystroke return
end tell
end tell
end tell
end tell
end tell
end tell
end tell
end timeout *)
close saving yes
end tell
end tell
end if
--
on PDF_Encoder(This_PDF)
try
do shell script "/usr/bin/mdls -name kMDItemEncodingApplications" & space & quoted form of POSIX path of This_PDF
if the result contains "Paper Capture Plug-In" then
-- This file has OCR text???
return false
else
-- This file does NOT have OCR text??? Open it.
return true
end if
on error
-- Write Log
end try
end PDF_Encoder
Hi,
I did several test with your script (I added a Finder labeling part), but it just painted every file (both OCR and nonOCR).
The thing is I didn’t scanned the files, they are a collection of files downloaded from different places, so I don’t know what software was used to do this.
Thats why I thought that to check if a certain PDF file had embeded type or not, was the standard thing that was gonna sort the files.
I still haven’t found a way of finding a way to do this…