Rasterize pdf

I was interested in this article.

As I see it, scripting the “Preview” interface is a good way to convert PDF to multipage TIFF. There is an AsObjC script for reverse conversion. But I would like to understand what is the meaning of these operations - there and back. What is the advantage of this?

Other point. As I see, sips and Image Events can export single TIFFs. I think, AsObjC can convert each page of PDF to single TIFFs set, as well. And, as I know, exists AsObjC script for merging TIFFs. So, “Preview” GUI scripting may be avoided. But firstly I want know: what is meaning.

NOTE: on Catalina “Save As…” is “Export…” menu item of interface.

If you convert a PDF to a TIFF and back again, then it is no longer searchable and can’t be indexed. Possibly that is the reason, but certainly ayden27 can say more about it.

I wrote here AsObjC solution for me and users, without involving Preview GUI scripting. I improve script’s speed further (creating and using RAM-disk instead of using Temporary Items folder of user domain):


use scripting additions
use framework "Foundation"
use framework "AppKit"
use framework "Quartz"
use framework "QuartzCore"

property NSString : a reference to NSString of current application
property |NSURL| : a reference to |NSURL| of current application
property PDFDocument : a reference to PDFDocument of current application
property NSImage : a reference to NSImage of current application
property NSImageView : a reference to NSImageView of current application
property NSBitmapImageRep : a reference to NSBitmapImageRep of current application
property NSTIFFFileType : a reference to NSTIFFFileType of current application
property desktopFolder : path to desktop folder

-- Create RAM disk
my makeRAMdisk()
-- Choose some pdf file
set aPDF to choose file of type "pdf"
-- Make tempopary job folder at RAM disk
tell application "System Events" to try
	make new folder at folder "Ram disk:" with properties {name:"TIFFs"}
end try
-- Save PDF as multiple TIFFs at the tempopary job folder
set TIFFs to my splitPDFasTIFFs(aPDF)
-- Merge TIFFs back to single PDF file, saved on the desktop
my combineFiles:TIFFs savingToPDF:(POSIX path of desktopFolder & "Combined.pdf")
-- Now, we can delete unneeded temporary folder
tell application "System Events" to delete folder "Ram disk:TIFFs:"
-- eject RAM disk (if need)
tell application "Finder" to eject "Ram disk:"


--=================================== HANDLERS =======================================

on makeRAMdisk()
	set dName to "RAM Disk"
	set dCapacity to 512 * 2 * 2000 --1GB
	set aCmd to "diskutil erasevolume HFS+ '" & dName & "' `hdiutil attach -nomount ram://" & (dCapacity as string) & "`"
	do shell script aCmd
end makeRAMdisk


on splitPDFasTIFFs(aPDF)
	set aURL to (|NSURL|'s fileURLWithPath:(POSIX path of aPDF))
	set aPDFdoc to PDFDocument's alloc()'s initWithURL:aURL
	set pCount to aPDFdoc's pageCount()
	set TIFFs to {}
	-- Split the PDF into pages exported as Tiff files
	repeat with i from 0 to (pCount - 1)
		set thisPage to (aPDFdoc's pageAtIndex:i)
		set thisDoc to (NSImage's alloc()'s initWithData:(thisPage's dataRepresentation()))
		if thisDoc = missing value then error "Error in getting imagerep from PDF in page:" & (i as string)
		set theData to thisDoc's TIFFRepresentation()
		set newRep to (NSBitmapImageRep's imageRepWithData:theData)
		set targData to (newRep's representationUsingType:NSTIFFFileType |properties|:{NSTIFFCompressionNone:1})
		set nextPath to "/Volumes/RAM Disk/TIFFs/" & i & ".tiff"
		set end of TIFFs to nextPath
		set outPath to (NSString's stringWithString:nextPath)
		(targData's writeToFile:outPath atomically:true) -- Export
	end repeat
	return TIFFs
end splitPDFasTIFFs


on combineFiles:TIFFs savingToPDF:destPosixPath
	-- make new empty PDF document
	set theDoc to PDFDocument's alloc()'s init()
	repeat with i from 0 to (count TIFFs) - 1
		-- make URL of the next PDF
		set inNSURL to (|NSURL|'s fileURLWithPath:(item (i + 1) of TIFFs))
		-- make PDF document from the URL
		set newDoc to (my pdfDocFromImageURL:inNSURL)
		-- get page of PDF
		set thePDFPage to (newDoc's pageAtIndex:0) -- zero-based indexes
		-- insert the page into main PDF
		(theDoc's insertPage:thePDFPage atIndex:i)
	end repeat
	set outNSURL to |NSURL|'s fileURLWithPath:destPosixPath
	-- save the main PDF
	(theDoc's writeToURL:outNSURL)
end combineFiles:savingToPDF:


on pdfDocFromImageURL:inNSURL
	set theImage to NSImage's alloc()'s initWithContentsOfURL:inNSURL
	set theSize to theImage's |size|()
	set theRect to {{0, 0}, theSize}
	set theImageView to NSImageView's alloc()'s initWithFrame:theRect
	theImageView's setImage:theImage
	set theData to theImageView's dataWithPDFInsideRect:theRect
	return PDFDocument's alloc()'s initWithData:theData
end pdfDocFromImageURL:

I wouldn’t call it a fatal flaw.

sips is a tool for working with raster images and colour profiles. Image Events provides access to its functionality.

PDF is a vector format, and thus sips is not built for such a purpose.

You can probably skip the intermediate files and RAM disk altogether, like this:

on rasterPDF:aPDF savingTo:destPosixPath
	set aURL to (|NSURL|'s fileURLWithPath:(POSIX path of aPDF))
	set aPDFdoc to PDFDocument's alloc()'s initWithURL:aURL
	set pCount to aPDFdoc's pageCount()
	repeat with i from 0 to (pCount - 1)
		set thisPage to (aPDFdoc's pageAtIndex:i)
		set thisDoc to (NSImage's alloc()'s initWithData:(thisPage's dataRepresentation()))
		if thisDoc = missing value then error "Error in getting image from PDF in page:" & (i as string)
		set theData to thisDoc's TIFFRepresentation()
		set theImage to (NSImage's alloc()'s initWithData:theData)
		set newPage to (current application's PDFPage's alloc's initWithImage:theImage)
		(aPDFdoc's removePageAtIndex:i)
		(aPDFdoc's insertPage:newPage atIndex:i)
	end repeat
	set outNSURL to |NSURL|'s fileURLWithPath:destPosixPath
	aPDFdoc's writeToURL:outNSURL
end rasterPDF:savingTo:

There are no words. Great. I will keep both scripts for myself, as a keepsake. Your script, Shane, is what I call optimal. By the way, I haven’t found the slightest information on PDF rasterization using AsObjC before.

I tested two scripts with 168-pages PDF. The speed is almost same (my is slightly faster), but resulting PDF of Shane script is 42 MB and resulting PDF of my script is 9 MB. I don’t understand why so big difference between them.

Is it possible to increase the resolution within the script? Say… to 150 dpi? The output seems like it’s about 72 dpi.

@Shane. I agree with KniazidisR–your script is outstanding. Very useful and beautifully compact.

BTW, the script did not work until I inserted “current application’s” in several spots. Is there some reason these are not needed?

@Mockman. I meant the words, fatal flaw, to refer to my script and its inability to fulfill the OP’s needs. Perhaps I should have been more clear.

FWIW, I tested Shane’s and KniazidisR’s scripts and used as a test document Shane’s ASObjC book (a PDF). I also tested with Preview (save as TIFF at 72 dpi and then as PDF) The file sizes were:

Original - 2.4 MB

With Shane’s script - 104.5 MB

With KniazidisR’s script - 21.5 MB

With Preview - 18.1 MB

I looked at the new PDF’s and Shane’s was as expected but the pages of the PDF created by KniazidisR’s script were out of order. This appears to be fixed by padding the counter used with the naming of the TIFF files.

set j to text -3 thru -1 of ("000" & i as text)
set outPath to (NSString's stringWithString:("/Volumes/RAM Disk/TIFFs/" & j & ".tiff"))

The PDFs created by KniazidisR’s and Shane’s scripts were both 72 dpi.

This version lets you specify the resolution:

on rasterPDF:aPDF savingTo:destPosixPath resolution:theDpi
	set aURL to (|NSURL|'s fileURLWithPath:(POSIX path of aPDF))
	set aPDFdoc to PDFDocument's alloc()'s initWithURL:aURL
	set pCount to aPDFdoc's pageCount()
	repeat with i from 0 to (pCount - 1)
		set thisPage to (aPDFdoc's pageAtIndex:i)
		-- do size calculations
		set pageSize to (thisPage's boundsForBox:(current application's kPDFDisplayBoxMediaBox))
		set pageWidth to current application's NSWidth(pageSize)
		set pageHeight to current application's NSHeight(pageSize)
		set pixelWidth to (pageWidth * theDpi / 72) div 1
		set pixelHeight to (pageHeight * theDpi / 72) div 1
		-- make bitmaps
		set theImageRep to (current application's NSPDFImageRep's imageRepWithData:(thisPage's dataRepresentation()))
		set newRep to (current application's NSBitmapImageRep's alloc()'s initWithBitmapDataPlanes:(missing value) pixelsWide:pixelWidth pixelsHigh:pixelHeight bitsPerSample:8 samplesPerPixel:4 hasAlpha:yes isPlanar:false colorSpaceName:(current application's NSDeviceRGBColorSpace) bytesPerRow:0 bitsPerPixel:32)
		-- store the existing graphics context
		current application's NSGraphicsContext's saveGraphicsState()
		-- set graphics context to new context based on the new bitmapImageRep
		(current application's NSGraphicsContext's setCurrentContext:(current application's NSGraphicsContext's graphicsContextWithBitmapImageRep:newRep))
		(theImageRep's drawInRect:{origin:{x:0, y:0}, |size|:{width:pixelWidth, height:pixelHeight}} fromRect:(current application's NSZeroRect) operation:(current application's NSCompositeSourceOver) fraction:1.0 respectFlipped:false hints:(missing value))
		-- restore state
		current application's NSGraphicsContext's restoreGraphicsState()
		-- make new image and page
		(newRep's setSize:{pageWidth, pageHeight})
		set theData to newRep's TIFFRepresentation()
		set theImage to (NSImage's alloc()'s initWithData:theData)
		set newPage to (current application's PDFPage's alloc's initWithImage:theImage)
		(aPDFdoc's removePageAtIndex:i)
		(aPDFdoc's insertPage:newPage atIndex:i)
	end repeat
	set outNSURL to |NSURL|'s fileURLWithPath:destPosixPath
	aPDFdoc's writeToURL:outNSURL
end rasterPDF:savingTo:resolution:

Thank you, Peavine, for your consideration. My script was not in the order of the pages. I made the correct fix in post #5, only slightly more efficient than padding the filename.

Also, I removed the unnecessary repeat loop (in the combineFiles handler) and now the script is 1.5 times faster. (And creates the PDF with size close to size of PDF created by Preview method.)

Shane, thanks for your last script. I ran 2 tests.

Your script successfully worked with a 5-page PDF and DPI = 600. With a 168-page PDF and DPI = 600, the script hangs and I get a message that the Script Debugger is not responding and takes 62 GB of memory !!!

It looks like a memory leak somewhere in the script. What do you say?

I tried your handler as following. Maybe alloc() statements need some additional parentheses?


use scripting additions
use framework "Foundation"
use framework "AppKit"
use framework "Quartz"
use framework "QuartzCore"

set aPDF to choose file of type "pdf"
set destPosixPath to (POSIX path of (path to desktop folder)) & "/Rasterized.pdf"
my rasterPDF:aPDF savingTo:destPosixPath resolution:600

on rasterPDF:aPDF savingTo:destPosixPath resolution:theDpi
	set aURL to (current application's |NSURL|'s fileURLWithPath:(POSIX path of aPDF))
	set aPDFdoc to current application's PDFDocument's alloc()'s initWithURL:aURL
	set pCount to aPDFdoc's pageCount()
	-- do size calculations
	set thisPage to (aPDFdoc's pageAtIndex:0)
	set pageSize to (thisPage's boundsForBox:(current application's kPDFDisplayBoxMediaBox))
	set pageWidth to current application's NSWidth(pageSize)
	set pageHeight to current application's NSHeight(pageSize)
	set pixelWidth to (pageWidth * theDpi / 72) div 1
	set pixelHeight to (pageHeight * theDpi / 72) div 1
	repeat with i from 0 to (pCount - 1)
		set thisPage to (aPDFdoc's pageAtIndex:i)
		-- make bitmaps
		set theImageRep to (current application's NSPDFImageRep's imageRepWithData:(thisPage's dataRepresentation()))
		set newRep to (current application's NSBitmapImageRep's alloc()'s initWithBitmapDataPlanes:(missing value) pixelsWide:pixelWidth pixelsHigh:pixelHeight bitsPerSample:8 samplesPerPixel:4 hasAlpha:yes isPlanar:false colorSpaceName:(current application's NSDeviceRGBColorSpace) bytesPerRow:0 bitsPerPixel:32)
		-- store the existing graphics context
		current application's NSGraphicsContext's saveGraphicsState()
		-- set graphics context to new context based on the new bitmapImageRep
		(current application's NSGraphicsContext's setCurrentContext:(current application's NSGraphicsContext's graphicsContextWithBitmapImageRep:newRep))
		(theImageRep's drawInRect:{origin:{x:0, y:0}, |size|:{width:pixelWidth, height:pixelHeight}} fromRect:(current application's NSZeroRect) operation:(current application's NSCompositeSourceOver) fraction:1.0 respectFlipped:false hints:(missing value))
		-- restore state
		current application's NSGraphicsContext's restoreGraphicsState()
		-- make new image and page
		(newRep's setSize:{pageWidth, pageHeight})
		set theData to newRep's TIFFRepresentation()
		set theImage to (current application's NSImage's alloc()'s initWithData:theData)
		set newPage to (current application's PDFPage's alloc()'s initWithImage:theImage)
		(aPDFdoc's removePageAtIndex:i)
		(aPDFdoc's insertPage:newPage atIndex:i)
	end repeat
	set outNSURL to current application's |NSURL|'s fileURLWithPath:destPosixPath
	aPDFdoc's writeToURL:outNSURL
end rasterPDF:savingTo:resolution:

The issue, sadly, is that ASObjC leaks memory badly, period. Initially it relied on automatic garbage collection, but when that was abandoned memory management was presumably just tacked on to AppleScript’s own, periodic, garbage collection. But even that doesn’t seem to clear everything out.

In most cases, the OS’s efficient overall memory management means it doesn’t matter much. But when you push it hard – which you’re doing in that test – it tends to bog down more or less completely.

(The leaking is such that if you run a batch of tests, the memory use is cumulative. I suspect that’s a straight bug.)

Further to that: the poor memory management is one of the reasons I withdrew my book on how to write ASObjC-based apps in Xcode. It’s just too easy to write apps that then crash intermittently because of memory problems (sometimes because a clean-up appears to have happened).

But it’s generally fine in applets, which mostly just run and quit.

Yes, this is a very unpleasant incident. Thanks for the clarification. I was coding the movie (97% of CPU, about), when your script tested.

I’m working on a project in which I need to get image data from a PDF and to increase the resolution of the image data. This is being done to perform optical character recognition on the image data. The thread is here and the script is in post 21.

I adapted code from Shane’s script in post 12 above and it appears to work. However, the purpose of Shane’s code is to rasterize a PDF, while I simply want to get the image data and to increase its resolution. I don’t completely understand Shane’s code and wanted to ask about a few issues.

My working handler is:

on getImageData(theFile, thePageNumber)
	set theFile to (current application's |NSURL|'s fileURLWithPath:theFile)
	set theDocument to current application's PDFDocument's alloc()'s initWithURL:theFile
	set thePage to (theDocument's pageAtIndex:(thePageNumber - 1))
	set pageSize to (thePage's boundsForBox:(current application's kPDFDisplayBoxMediaBox))
	set pageWidth to current application's NSWidth(pageSize)
	set pageHeight to current application's NSHeight(pageSize)
	set pixelWidth to (pageWidth * 300 / 72) div 1
	set pixelHeight to (pageHeight * 300 / 72) div 1
	set pdfImageRep to (current application's NSPDFImageRep's imageRepWithData:(thePage's dataRepresentation()))
	set newRep to (current application's NSBitmapImageRep's alloc()'s initWithBitmapDataPlanes:(missing value) pixelsWide:pixelWidth pixelsHigh:pixelHeight bitsPerSample:8 samplesPerPixel:4 hasAlpha:yes isPlanar:false colorSpaceName:(current application's NSDeviceRGBColorSpace) bytesPerRow:0 bitsPerPixel:32)
	current application's NSGraphicsContext's saveGraphicsState()
	current application's NSGraphicsContext's setCurrentContext:(current application's NSGraphicsContext's graphicsContextWithBitmapImageRep:newRep)
	pdfImageRep's drawInRect:{origin:{x:0, y:0}, |size|:{width:pixelWidth, height:pixelHeight}} fromRect:(current application's NSZeroRect) operation:(current application's NSCompositeSourceOver) fraction:1.0 respectFlipped:false hints:(missing value)
	current application's NSGraphicsContext's restoreGraphicsState()
	return newRep's TIFFRepresentation() -- this works but seems wrong
	-- return pdfImageRep's TIFFRepresentation() -- shouldn't above line be this
end getImageData

First, why is newRep returned at the end of the handler instead of pdfImageRep? It seems that pdfImageRep contains the data and has been set to the higher resolution (i.e. pixelWidth and pixelHeight). Second, is there a simpler approach to get the image data given that I don’t want to place the image data back in the PDF?

On a related matter, I selected a resolution of 300 dpi based on trial-and-error in performing OCR, and I wonder if there’s a better way to get an optimal figure. The PDF’s will vary greatly and perhaps there is no best resolution.

Thanks for the help.

1 Like

Mockman. Thanks for the response and my apologies for not explaining better. The image will be used in a script that performs Optical Character Recognition and returns text found in images in the PDF. I’ve edited my post to better explain this.

Ah, silly me. I’ll delete my prior post momentarily.

But on that note, I use an app called FineReader to perform OCR. They provide the OCR engine in devonthink but I bought their app because I was interested in other language scripts (other than Latin-based). Anyway, when processing a text, by default they ask to set the resolution to 300 dpi but I can’t offer any analysis on whether that makes a difference in accuracy.

1 Like

Because the content of pdfImageRep has been drawn into newRep – that’s how you alter the resolution.

You could just use NSIamgeRep’s imageRepWithContentsOfURL:.

Optimal for what?

1 Like

Thanks Shane. That answers my questions. In retrospect, my question about resolution was not relevant.