How to OCR scanned PDF using PDFKit on macOS Sonoma or Sequoia

On macOS Sonoma (14.7.4) or Sequoia (currently 15.3.2), the Preview application allows one to open a scanned PDF, choose Export…, and on that export panel, select Embed text. Saving the PDF performs OCR on the content allowing text selection and search.

Although documented in the Swift version of Apple’s PDFKit PDFDocument’s Write Operations PDFDocumentWriteOption as saveTextFromOCROption, it is omitted from the Objective-C counterpart docs. It still works in ASOC though.

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use framework "PDFKit"
use scripting additions

property ca : current application
property SUFFIX : "_withOCR"

-- select a PDF that you know has scanned text that has not been OCR'd
set thisPDF to POSIX path of (choose file of type "PDF") as text
set ext to (ca's NSString's stringWithString:thisPDF)'s pathExtension()
set outPDF to (ca's NSString's stringWithString:thisPDF)'s stringByDeletingPathExtension()
set outPDF to outPDF's stringByAppendingString:SUFFIX
set outPDF to outPDF's stringByAppendingPathExtension:ext

set optDict to ca's NSDictionary's dictionaryWithObject:(ca's PDFDocumentWriteOption's saveTextFromOCROption) forKey:(ca's PDFDocumentWriteOption)

set pdf to ca's PDFDocument's alloc()'s initWithURL:(ca's NSURL's fileURLWithPath:thisPDF)
-- write outPDF as an OCR'd PDF
pdf's writeToFile:outPDF withOptions:optDict
return

1 Like

This is great work @VikingOSX!

I had a failure on the first run due to multiple file extension delimiters in the random PDF I chose for testing. filename.compressed.pdf in this case. Not a common issue I assume. Otherwise fast and accurate! Thanks for posting this.

1 Like

Awesome!
The OCR text was generated so easily—I’m honestly amazed!
I’m gonna start using it right away. This is just perfect!
And the fact that this syntax works too makes it super handy.

set ocidOptionDict to current application's NSMutableDictionary's alloc()'s init()
ocidOptionDict's setValue:(true) forKey:(current application's PDFDocumentOptimizeImagesForScreenOption)
ocidOptionDict's setValue:(true) forKey:(current application's PDFDocumentSaveTextFromOCROption)
set boolDone to ocidActivDoc's writeToURL:(ocidSaveFilePathURL) withOptions:(ocidOptionDict)

and it even works with screenshots!
Spotlight search is gonna love this! :laughing:
Huge thanks! :raised_hands:

Hi @VikingOSX.

Since this topic’s an offer of working code that others might find useful and/or interesting (and two people have already found it so! :sunglasses:), I’ve moved it to our Code Exchange forum.

1 Like

Nigel,

Thanks for moving this to the Code Exchange forum.