Optical Character Recognition (OCR) Script

The following is my first attempt to get text from a PDF with OCR. As presently written, it works with one page only and operates by converting the PDF page into image data, which is then converted to text with the Vision framework.

This works with a PDF that contains easily-recognized text but returns poor results with anything else. I’ll work on that.

use framework "AppKit" -- for NSImage
use framework "Foundation"
use framework "Quartz"
use framework "Vision"
use scripting additions

on main()
	set thePage to 1
	set theFile to POSIX path of (choose file of type {"pdf"})
	set imageData to getImageData(theFile, thePage)
	set theText to getText(imageData)
end main

on getImageData(theFile, thePage)
	set theFile to current application's |NSURL|'s fileURLWithPath:(theFile)
	set theDocument to current application's PDFDocument's alloc()'s initWithURL:theFile
	set thePage to (theDocument's pageAtIndex:(thePage - 1))
	set theData to (current application's NSImage's alloc()'s initWithData:(thePage's dataRepresentation()))
	return theData's TIFFRepresentation()
end getImageData

on getText(imageData)
	set requestHandler to current application's VNImageRequestHandler's alloc()'s initWithData:imageData options:(missing value)
	set theRequest to current application's VNRecognizeTextRequest's alloc()'s init()
	theRequest's setUsesLanguageCorrection:false
	requestHandler's performRequests:(current application's NSArray's arrayWithObject:(theRequest)) |error|:(missing value)
	set theResults to theRequest's results()
	set theArray to current application's NSMutableArray's new()
	repeat with anObservation in theResults
		set theResult to ((anObservation's topCandidates:1)'s objectAtIndex:0)'s |string|()
		(theArray's addObject:theResult)
	end repeat
	return (theArray's componentsJoinedByString:linefeed) as text
end getText

main()

Hi Peavine
This script works great in my workflow, just wondering if this can work to OCR a table [pdf/ image] and output as csv ?
Cheers

I’d suggest giving it a try. It might happen that the OCR works by columns, though. And the process will probably not be very robust, depending on the content of the table cells.

In which order – first cell in first row, 2nd cell in first row … or first cell in first row, first cell in 2nd row?
I’m asking because in many PDFs I receive, it’s the second one.

Edit Not really important though: one could use the rectangles associated with every text blob to sort the text in which ever way.

@One208. My PDF script (and my script in post 4) can be modified to return csv text from a table in a PDF by deleting line 1 below from my script and replacing it with line 2:

-- PDF Script
return (theArray's componentsJoinedByString:linefeed) as text
return (theArray's componentsJoinedByString:",") as text

-- Script in Post 4
return (theArray's componentsJoinedByString:linefeed)
return (theArray's componentsJoinedByString:",")

I tested this with a Numbers spreadsheet that I saved as a PDF, and the order of the returned text was cells A1 to A10, B1 to B10, and so on. This would not seem to be very useful, but I don’t know how to change that.

Right now I’m working to get more accurate results with my PDF OCR script (which is very substandard). Afterwards, I’ll look at the manner and order in which text is returned from an image or PDF.

By looking at the bounding box of each text fragment and sorting them by y, then x coordinates. It’s feasible, but it might be a bit convoluted in AppleScript.

2 Likes

The following is my revised script that performs OCR on one page of a PDF. This generally works well, but there were instances in my testing where the script found no text. One workaround is to use the screencapture script in post 4; another is to save the PDF as a PNG image and to use the script in post 1 above.

use framework "AppKit"
use framework "Foundation"
use framework "Quartz"
use framework "Vision"
use scripting additions

set pageNumber to 1 -- user set as desired
set imageResolution to 300 -- user test different values
set theFile to POSIX path of (choose file of type {"com.adobe.pdf"})
set imageData to getImageData(theFile, pageNumber, imageResolution)
set theText to getText(imageData)

on getImageData(theFile, pageNumber, thePPI) -- based on a handler by Shane Stanley
	set theFile to current application's |NSURL|'s fileURLWithPath:theFile
	set theDocument to current application's PDFDocument's alloc()'s initWithURL:theFile
	set thePage to (theDocument's pageAtIndex:(pageNumber - 1))
	set pageSize to (thePage's boundsForBox:(current application's kPDFDisplayBoxMediaBox))
	set pageWidth to current application's NSWidth(pageSize)
	set pageHeight to current application's NSHeight(pageSize)
	set pixelWidth to (pageWidth * thePPI / 72) div 1
	set pixelHeight to (pageHeight * thePPI / 72) div 1
	set pdfImageRep to (current application's NSPDFImageRep's imageRepWithData:(thePage's dataRepresentation()))
	set newRep to (current application's NSBitmapImageRep's alloc()'s initWithBitmapDataPlanes:(missing value) pixelsWide:pixelWidth pixelsHigh:pixelHeight bitsPerSample:8 samplesPerPixel:4 hasAlpha:yes isPlanar:false colorSpaceName:(current application's NSDeviceRGBColorSpace) bytesPerRow:0 bitsPerPixel:32)
	current application's NSGraphicsContext's saveGraphicsState()
	current application's NSGraphicsContext's setCurrentContext:(current application's NSGraphicsContext's graphicsContextWithBitmapImageRep:newRep)
	pdfImageRep's drawInRect:{origin:{x:0, y:0}, |size|:{width:pixelWidth, height:pixelHeight}} fromRect:(current application's NSZeroRect) operation:(current application's NSCompositeSourceOver) fraction:1.0 respectFlipped:false hints:(missing value)
	current application's NSGraphicsContext's restoreGraphicsState()
	return newRep's TIFFRepresentation()
end getImageData

on getText(imageData)
	set requestHandler to current application's VNImageRequestHandler's alloc()'s initWithData:imageData options:(missing value)
	set theRequest to current application's VNRecognizeTextRequest's alloc()'s init()
	requestHandler's performRequests:(current application's NSArray's arrayWithObject:(theRequest)) |error|:(missing value)
	set theResults to theRequest's results()
	set theArray to current application's NSMutableArray's new()
	repeat with aResult in theResults
		(theArray's addObject:(((aResult's topCandidates:1)'s objectAtIndex:0)'s |string|()))
	end repeat
	return (theArray's componentsJoinedByString:linefeed) as text
end getText
1 Like

Thanks for responding, look forward to update. Have a good day

I spent some time on this and couldn’t get it to work. A sticking point for me was getting the coordinates of each corner of each bounding box. However, even if I had accomplished this, I think the approach I envisioned would be so long and involved and unreliable as not to be worth the effort.

I’d use only the lower left corner, not all of them. Assuming, of course, that there’s no skew in the document, or only a small one.

1 Like

I haven’t quite given up on getting the bounding box (coordinates) of a returned text fragment and thought I would raise this issue just in case anyone knows or can easily divine the answer.

Line 1 below returns the coordinates of the lower left corner of each text fragment. However, this is of little use because there are often multiple observations of each text fragment, and you might end up with eight differing coordinates for one text fragment.

Line 2 below returns the actual text of the text fragment and deals with the multiple-observation issue by using the topCandidates method and setting it to 1. I thought this might also be used to return the coordinates of the lower left corner but I got an unrecognized-selector error.

set bottomCoordinatesOne to anObservation's bottomLeft() --> {x:0.091569775733, y:0.898666666667}
set theResult to ((anObservation's topCandidates:1)'s objectAtIndex:0)'s |string|() --> (NSString) "Line one"
set bottomCoordinatesTwo to ((anObservation's topCandidates:1)'s objectAtIndex:0)'s bottomLeft() --> -[VNRecognizedText bottomLeft]: unrecognized selector sent to instance 0x600000de79e0

I’m sorry, I should’ve been more detailed in my suggestion to use the bounding box.

You’re trying to get the bottomLeft of a ´VNTextResult. That's not possible, because a VNTextResult` doesn’t have that property, according to its documentation.

I think that you’d have to use the VNTextResult’s method boundingBoxForRange for the 1st topCandidate. That one expects a range argument, which you could perhaps simply set to (0:0) to get only the bounding box of the first character of this observation. Assuming, that the text recognition is good enough, of course, to give you one observation array for each table cell.

chrillek. Thanks for the suggestion.

I think the issue is that the topCandidates method does not return the bounding box information I want. If there is only one observation for each text fragment then my issue is solved, but that’s not normally the case. Using boundingBoxForRange doesn’t seem to return correct information.

Seems I was wrong again.
Here’s what I do in JavaScript:

 const OCRresult = schnipsel.map(s => { // This loops over all observations
      const bestHit = s.topCandidates(1).js[0]; // get the best candidate
           return { // return a new object containing the string and the lower left corner 
          "string":  bestHit.string.js,
          "origin" : { 
                       x: s.bottomLeft.x * imageSize.width,
                       y: s.bottomLeft.y * imageSize.height
                     } 
      }
    });

So, I use the topCandidates:1 objectAtIndex:0 to get the string. And I use the bottomLeft of the observation (and there’s only one observation in this case, namely the one of which I’m asking for the topCandidates). I also multiply with the image width/height to get the lower left in pixel coordinates.

This code worked about a year ago. So, it may well be that Apple changed something behind the scenes, and it has stopped working. Why the boundingBoxForRange approach doesn’t give you meaning full values – no idea.

1 Like

It appears that I misunderstood what is returned by the results() property, and, because of that, everything further on didn’t work. The following returns a list of lists with each sublist containing three items: a text fragment’s string; x coordinate of the lower left corner of the bounding box; and y coordinate of the lower left corner of the bounding box.

use framework "Foundation"
use framework "Vision"
use scripting additions

set theFile to POSIX path of (choose file of type {"public.image"})
set ocrData to getImageText(theFile)

on getImageText(theFile)
	set theFile to current application's |NSURL|'s fileURLWithPath:theFile
	set requestHandler to current application's VNImageRequestHandler's alloc()'s initWithURL:theFile options:(missing value)
	set theRequest to current application's VNRecognizeTextRequest's alloc()'s init()
	requestHandler's performRequests:(current application's NSArray's arrayWithObject:(theRequest)) |error|:(missing value)
	set theResults to theRequest's results()
	set theArray to current application's NSMutableArray's new()
	repeat with aResult in theResults
		set aResult to contents of aResult
		set theSubArray to current application's NSMutableArray's new()
		set theString to ((aResult's topCandidates:1)'s objectAtIndex:0)'s |string|()
		set theBottomLeft to aResult's bottomLeft()
		(theSubArray's addObjectsFromArray:{theString, theBottomLeft's x, theBottomLeft's y})
		(theArray's addObject:theSubArray)
	end repeat
	return theArray as list
end getImageText

I’m pretty-much finished with my study of the Vision framework but wanted to include a handler that can be used to write the returned-string from any of the scripts to a file on the desktop.

-- put this after the line that begins with "set the text"
writeFile(theText)

-- put this at the end of the script
on writeFile(theText)
	set theText to current application's NSString's stringWithString:theText
	set theFolder to (current application's NSHomeDirectory()'s stringByAppendingPathComponent:"Desktop")
	set theFile to theFolder's stringByAppendingPathComponent:"OCR Results.txt"
	theText's writeToFile:theFile atomically:true encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
end writeFile
1 Like

@peavine - this is great!

I cheated when I had a project like this last Fall - cheated because I saw that Automator had a “get text of image” block. I had 900 sequential images from which I wanted the Mac vision text (it was way more accurate than the other OCR I was using)

I had a hunch that this could have been done with OBJc but wasn’t that motivated. Now, you got to it! Thank you!

It was fast in Automator but required me to grab the results from that process. This scripting you did here would let me run it as part of a more detailed process. Love it!

1 Like

My script in post 21 performs OCR on one page of a PDF document. It works as expected in most instances but returns limited or no text with a few PDFs. In preliminary testing, the following script appears to fix this issue and seems a bit more reliable overall.

use framework "AppKit"
use framework "Foundation"
use framework "PDFKit"
use framework "Vision"
use scripting additions

set pageNumber to 1 -- user set as desired
set imageResolution to 300 -- user test different values
set theFile to POSIX path of (choose file of type {"com.adobe.pdf"})
set imageData to getImageData(theFile, pageNumber, imageResolution)
set theText to getText(imageData)

on getImageData(theFile, pageNumber, thePPI) -- based on a handler by Shane Stanley
	set theFile to current application's |NSURL|'s fileURLWithPath:theFile
	set theDocument to current application's PDFDocument's alloc()'s initWithURL:theFile
	set thePage to (theDocument's pageAtIndex:(pageNumber - 1))
	set pageSize to (thePage's boundsForBox:(current application's kPDFDisplayBoxMediaBox))
	set pageWidth to current application's NSWidth(pageSize)
	set pageHeight to current application's NSHeight(pageSize)
	set pixelWidth to (pageWidth * thePPI / 72) div 1
	set pixelHeight to (pageHeight * thePPI / 72) div 1
	set pdfImageRep to (current application's NSPDFImageRep's imageRepWithData:(thePage's dataRepresentation()))
	set newRep to (current application's NSBitmapImageRep's alloc()'s initWithBitmapDataPlanes:(missing value) pixelsWide:pixelWidth pixelsHigh:pixelHeight bitsPerSample:8 samplesPerPixel:4 hasAlpha:yes isPlanar:false colorSpaceName:(current application's NSDeviceRGBColorSpace) bytesPerRow:0 bitsPerPixel:32)
	current application's NSGraphicsContext's saveGraphicsState()
	current application's NSGraphicsContext's setCurrentContext:(current application's NSGraphicsContext's graphicsContextWithBitmapImageRep:newRep)
	pdfImageRep's drawInRect:{origin:{x:0, y:0}, |size|:{width:pixelWidth, height:pixelHeight}} fromRect:(current application's NSZeroRect) operation:(current application's NSCompositeSourceOver) fraction:1.0 respectFlipped:false hints:(missing value)
	current application's NSGraphicsContext's restoreGraphicsState()
	return (newRep's representationUsingType:(current application's NSJPEGFileType) |properties|:{NSImageCompressionFactor:1.0})
end getImageData

on getText(imageData)
	set requestHandler to current application's VNImageRequestHandler's alloc()'s initWithData:imageData options:(missing value)
	set theRequest to current application's VNRecognizeTextRequest's alloc()'s init()
	requestHandler's performRequests:(current application's NSArray's arrayWithObject:(theRequest)) |error|:(missing value)
	set theResults to theRequest's results()
	set theArray to current application's NSMutableArray's new()
	repeat with aResult in theResults
		(theArray's addObject:(((aResult's topCandidates:1)'s objectAtIndex:0)'s |string|()))
	end repeat
	return (theArray's componentsJoinedByString:linefeed) as text
end getText
1 Like

The following script prompts the user to select the language that will be used to perform OCR on a selected file. One option is automatic language detection, but this will yield less accurate results in some circumstances. The script can be edited to use additional languages.

-- revised 2023.12.16

use framework "Foundation"
use framework "Vision"
use scripting additions

set theFile to (choose file of type {"public.image"} with prompt "Select a file for OCR processing")
set languageCodes to {Chinese:{"zh-Hant", "zh-Hans"}, English:{"en-US"}, French:{"fr-FR"}, Japanese:{"ja-JP"}, Spanish:{"es-ES"}} -- edit as desired
set languageCodes to (current application's NSDictionary's dictionaryWithDictionary:languageCodes)
set dialogList to ((languageCodes's allKeys())'s sortedArrayUsingSelector:"localizedStandardCompare:") as list
set theLanguage to (choose from list ({"Automatic"} & dialogList) with prompt "Select an option for use in text recognition" default items "Automatic" with title "Optical Character Recognition")
if theLanguage is false then error number -128
set theText to getText(theFile, theLanguage, languageCodes)

on getText(theFile, theLanguage, languageCodes)
	set theFile to current application's |NSURL|'s fileURLWithPath:(POSIX path of theFile)
	set requestHandler to current application's VNImageRequestHandler's alloc()'s initWithURL:theFile options:(missing value)
	set theRequest to current application's VNRecognizeTextRequest's alloc()'s init()
	if theLanguage is {"Automatic"} then
		theRequest's setAutomaticallyDetectsLanguage:true
	else
		set languageCode to languageCodes's valueForKey:(item 1 of theLanguage)
		theRequest's setRecognitionLanguages:languageCode
	end if
	requestHandler's performRequests:(current application's NSArray's arrayWithObject:(theRequest)) |error|:(missing value)
	set theResults to theRequest's results()
	set theArray to current application's NSMutableArray's new()
	repeat with aResult in theResults
		(theArray's addObject:(((aResult's topCandidates:1)'s objectAtIndex:0)'s |string|()))
	end repeat
	return (theArray's componentsJoinedByString:linefeed) as text
end getText

The following script returns supported language codes, and it should probably be run and the results checked before using the above script. It could be incorporated in the above script as error checking, if that’s desired.

use framework "Foundation"
use framework "Vision"

set theRequest to current application's VNRecognizeTextRequest's alloc()'s init()
set supportedLanguageCodes to theRequest's supportedRecognitionLanguagesAndReturnError:(missing value)
return supportedLanguageCodes as list

(* On my Sonoma computer this script returns {"en-US", "frFR", "it-IT", "de-DE", "es-ES", "pt-BR", "zh-Hans", "zh-Hant", "yue-Hans", "yue-Hant", "ko-KR", "ja-JP", "ru-RU", "uk-UA", "th-TH", "vi-VT"} *)

Fredrik71. Thanks for looking at my script. I agree that prompting the user twice is a bit clumsy.

I suspect most users will want to perform OCR with one or two languages only, and if that’s the case my script can be greatly simplified. I’ve included a handler that checks to see if the specified language codes are supported–it takes less than a millisecond to run.

use framework "Foundation"
use framework "Vision"
use scripting additions

set theFile to (choose file of type {"public.image"})
set languageCodes to {"zh-Hant", "zh-Hans"} -- user set as desired
checkLanguageSupport(languageCodes) -- disable if desired
set theText to getText(theFile, languageCodes)

on checkLanguageSupport(languageCodes)
	set theRequest to current application's VNRecognizeTextRequest's alloc()'s init()
	set supportedLanguageCodes to theRequest's supportedRecognitionLanguagesAndReturnError:(missing value)
	set setOne to current application's NSSet's setWithArray:languageCodes
	set setTwo to current application's NSSet's setWithArray:supportedLanguageCodes
	if (setOne's isSubsetOfSet:setTwo) is false then display dialog "A language code is not supported" buttons {"OK"} cancel button 1 default button 1
end checkLanguageSupport

on getText(theFile, languageCodes)
	set theFile to current application's |NSURL|'s fileURLWithPath:(POSIX path of theFile)
	set requestHandler to current application's VNImageRequestHandler's alloc()'s initWithURL:theFile options:(missing value)
	set theRequest to current application's VNRecognizeTextRequest's alloc()'s init()
	theRequest's setRecognitionLanguages:languageCodes
	-- theRequest's setUsesLanguageCorrection:true -- enable if desired but not for Chinese language
	requestHandler's performRequests:(current application's NSArray's arrayWithObject:(theRequest)) |error|:(missing value)
	set theResults to theRequest's results()
	set textFragments to current application's NSMutableArray's new()
	repeat with aResult in theResults
		(textFragments's addObject:(((aResult's topCandidates:1)'s objectAtIndex:0)'s |string|()))
	end repeat
	return (textFragments's componentsJoinedByString:linefeed) as text
end getText