Script to get metadata of a PDF

stefcyr · February 26, 2009, 6:51pm

Is there a way to access a PDF metadata such as author, PDF producer and so on using AppleScript or shell scripts?

StefanK · February 26, 2009, 8:45pm

Hi,

I wrote a little Foundation Tool CLI PDFMetadata, which prints out the metadata you requested
Save it somewhere and use this syntax, you have only to adjust the paths


set theMetadata to do shell script "'/path/to/PDFMetadata' 'path/to/document.pdf'"

The result looks like
Producer: Acrobat Distiller 8.0.0 (Macintosh)
CreationDate: 2007-04-24 11:09:29 +0200
ModDate: 2009-02-26 21:27:39 +0100
Keywords: (
“keyword 1; keyword 2; keyword 3”
)
Author: MeMyselfAndI
Title: mydocument.pdf
Creator: Photoshop: pictwpstops filter 1.0

CalvinFold · February 26, 2009, 9:39pm

Without a separate utility or script around, I discovered that Metadata is human-readable in a hexdump of a PDF. So that means GREP could be used to search inside and parse for metadata.

Looks like Adobe uses pdf:metadatafieldname</pdf:metadatafieldname> to enclose keywords as well as untagged which might be more difficult to parse.

So for example, if you had Keyword = “one, two, three” you’d be looking for:

pdf:Keywordone, two, three</pdf:Keyword>

This assumes you specifically look for a certain metadata field. If you need them all, that might be different, you’d have to experiment. You can use this to see what I mean (save as a droplet):

--
-- Get Hexdump Info v4
-- by Kevin Quosig, 3/28/07
--
-- Used to drag-n-drop files to examine their contents/headers.
--
-- Most code segments courtesy of James Nierodzik of MacScripter
-- http://bbs.applescript.net/profile.php?id=8727
--


--
-- UTILITY HANDLER
--

-- Search and Replace routine using AppleScript Text Item Delimiters "trick"
--
on searchNreplace(parse_me, find_me, replace_with_me)
	
	--save incoming TID state, set new TIDs
	set {ATID, AppleScript's text item delimiters} to {"", find_me}
	
	--using the specified character as a break point to strip the delimiter out and break the string into items
	set being_parsed to text items of parse_me
	
	--switch the TIDs again (replace string)
	set AppleScript's text item delimiters to {replace_with_me}
	
	--coerce it back to a string with new delimiters
	set parse_me to being_parsed as string
	
	--restore incoming TID state
	set AppleScript's text item delimiters to ATID
	
	--return results
	return parse_me
	
end searchNreplace


--
-- MAIN HANDLER
--
on open fileList
	
	-- parse through files dropped onto droplet
	repeat with i from 1 to number of items in fileList
		
		set AppleScript's text item delimiters to {""} --reset delimiters
		set this_item to item i of fileList as string ---pick item to work with
		set this_item_posix to quoted form of POSIX path of this_item --need POSIX path for shell scripts
		set doc_name to name of (info for alias this_item) --used for renaming the TextEdit window
		
		--Improved hexdump script line by TheMouthofSauron at MacScripter
		--http://bbs.applescript.net/viewtopic.php?pid=77811#p77811
		--
		--hexdump with the -C parameter formats the hexdump as columns of hex pairs
		--and then a column with a human-readable "ASCII translation" delimited by a pipe
		--character at the beginning and end of the ASCII column
		--
		--"awk" takes the entire -C formatted hexdump line ($0 = all arguements)
		--and filters-out the hex pairs and the delimiting of pipe characters
		--(return only 16 characters starting at position 62)
		--
		set hex_dump to (do shell script "hexdump -C " & this_item_posix & " | awk '{print(substr($0,62,16))}'")
		
		--remove carriage returns so output is one giant paragraph
		--(allows for TextEdit searching for strings and manual scanning)
		set hex_dump to searchNreplace(hex_dump, return, "")
		
		--write to TextEdit window and rename window to file name to keep things straight
		tell application "TextEdit"
			make new document
			set text of front document to hex_dump
			set name of front window to doc_name
		end tell
	end repeat
end open

stefcyr · February 27, 2009, 2:22pm

Thank you Stefan and Calvin, I will try both solutions.

porkozone · February 27, 2009, 4:20pm

If you have any success, I would love to see the code that pulled the actual metadata. Thanks.

CalvinFold · February 27, 2009, 6:23pm

I was in a hurry earlier, but a GREP search routine that you could adapt is below.

It takes two inputs: the path to the file you want to GREP the innards of, and a list of strings to look for. It was specifically designed to be given a list of strings and if it found one of them to stop and return which one it found. You’d have to adapt it to actually pull data between two strings (PDF tags) or to generically look for all metadata, but it gives you some idea how to acces GREP.

You’d use the routine I posted earlier to do the research for WHAT to look for…i.e. what GREP is “seeing” during it’s searches.

Sounds messier than it ends-up being. I’ve found myself needing to parse file innards like this alot, oddly.

StefanK’s is probably easier, but means anyplace you used the script you’d have to be sure his add-on was handy. I prefer to make all my apps stand-alone since I can’t count on such things being handy and have to deploy things to dozens of machines. (No slight against StefanK, just different methodologies. StefanK is my hero! :D)

-- revised GREP routine courtesy of
-- Bruce Phillips of MacScripter
-- http://bbs.applescript.net/viewtopic.php?pid=83871#p83871
--
on grepForString(path_to_grep, search_list)
	repeat with current_grep_item in search_list
		try --known bug between AppleScript and GREP where if GREP finds nothing, AppleScript errors-out
			do shell script "/usr/bin/grep --count " & quoted form of current_grep_item & " " & quoted form of POSIX path of path_to_grep
			set grep_result to result
			exit repeat
		on error error_message number error_number
			if error_message is "0" then -- grep didn't find anything
				set grep_result to 0
			else
				-- pass on the error
				error error_message number error_number
			end if
		end try
	end repeat
	
	return {grep_result, contents of current_grep_item}
end grepForString

mark_hunte · February 27, 2009, 9:09pm

Using mdls (spotlight)

Select the pdf files in finder and run.

set biglist to {}
set theListCommand to {"kMDItemFSName ", "kMDItemAuthors ", "kMDItemCreator ", "kMDItemTitle ", "kMDItemDescription ", "kMDItemContentCreationDate  "}
set theList to {"File Name = ", "Author = ", "Creator = ", "Title = ", "Description = ", "Content Creation Date = "}
tell application "Finder"
	set SiTem to selection
	repeat with item_a from 1 to number of items in SiTem
		set this_item to item item_a of SiTem as string
		set this_item to POSIX path of this_item
		repeat with item_b from 1 to number of items in theListCommand
			set this_kMDItem to item item_b of theListCommand as string
			set theResult to words of (do shell script "/usr/bin/mdls -name " & this_kMDItem & "-raw -nullMarker None " & quoted form of this_item)
			set this_kMDItemResult to ""
			repeat with item_c from 1 to number of items in theResult
				set this_kMDItemResult to this_kMDItemResult & item item_c of theResult & space as string
			end repeat
			copy item item_b of theList & this_kMDItemResult & return to end of biglist
		end repeat
		set last item of biglist to return as string
	end repeat
end tell
biglist as string

Most likely can be trimmed…

porkozone · February 28, 2009, 4:18am

Can Spotlight get the PDF Producer (ie distiller, PDF Library, Quartz, etc), or is that metadata not seen by spotlight?

mark_hunte · February 28, 2009, 8:35am

Yes, I called it "Creator = " in the script. The item name for the mdls is kMDItemCreator

Use the script below on any file to get its metadata, This will show you what you can get.
(The script was originally posted on macosxhints forums in 2005)


(* This script will open any selected file/s in finder  and  do a shell script " mdls " to get the metadata of the file/s and display the result in a Texedit document for each file 
©Mark Hunte 2005 - feel free to use as you wish*)
tell application "Finder"
	set SiTem to selection
	repeat with i from 1 to number of items in SiTem
		set this_item to item i of SiTem as string
		set this_item to POSIX path of this_item
		do shell script "mdls " & quoted form of this_item & " | open -f"
		
	end repeat
end tell

stefcyr · February 28, 2009, 1:10pm

Hi Mark, your 1st script does exactly what I need, the only thing I’m missing is what I get from your second script “kMDItemEncodingApplications”. I’m getting a result in that field in TextEdit that I would like to get in the 1st script. That’s all I need to get it to be perfect, any idea?

porkozone · February 28, 2009, 2:22pm

No, not that. If you open a PDF in Acrobat, and choose file>properties, there is a metadata field called “Application” which shows the application that created the document (InDesign, etc), and another metadata field called “PDF Producer”, which is the particular “engine” that created the PDF (Distiller, PDF Producer (when using export in Adobe apps), Quartz (when using OSX’s built-in PDF creator via the print dialog), etc). All the different engines create their own unique “issues” when creating PDFs.

The kMDItemCreator item is more like the “Application” field built into the actual PDF - and the more I’ve looked at the various output from all the scripts shown here, it looks like the kMDItemCreator doesn’t use the PDF metatags built into the PDF (like is shown in the hexdump)…it’s more like the OSX file-based creator code or something (which is more volatile, and can be stripped if the file goes through another OS at some point.

It appears the Spotlight method knows nothing anything about the PDF Producer, and the kMDItemCreator is not the same as the .

Thanks everyone for sharing these various ways of skinning the same cat, though - I’m learning lots.

StefanK · February 28, 2009, 3:07pm

Just add the matadata parameter kMDItemEncodingApplication to the list theListCommand
and a appropriate keyword to the list theList

mark_hunte · March 1, 2009, 3:45pm

Thanks Stefank,
And stefcyr, It looks like you did find the bit you where looking for, "kMDItemEncodingApplications " would seems to be the PDF Producer. And the "kMDItemCreator " is actually the Content Creator as porkozone pointed out.

Updated Script

set biglist to {}
set theListCommand to {"kMDItemFSName ", "kMDItemAuthors ", "kMDItemEncodingApplications ", "kMDItemCreator ", "kMDItemTitle ", "kMDItemDescription ", "kMDItemContentCreationDate  "}
set theList to {"File Name = ", "Author = ", "PDF Producer = ", "Content Creator = ", "Title = ", "Description = ", "Content Creation Date = "}
tell application "Finder"
	set SiTem to selection
	repeat with item_a from 1 to number of items in SiTem
		set this_item to item item_a of SiTem as string
		set this_item to POSIX path of this_item
		repeat with item_b from 1 to number of items in theListCommand
			set this_kMDItem to item item_b of theListCommand as string
			set theResult to words of (do shell script "/usr/bin/mdls -name " & this_kMDItem & "-raw -nullMarker None " & quoted form of this_item)
			set this_kMDItemResult to ""
			repeat with item_c from 1 to number of items in theResult
				set this_kMDItemResult to this_kMDItemResult & item item_c of theResult & space as string
			end repeat
			copy item item_b of theList & this_kMDItemResult & return to end of biglist
		end repeat
		set last item of biglist to return as string
	end repeat
end tell
biglist as string

stefcyr · March 1, 2009, 11:32pm

Works great now! You guys are geniuses!

porkozone · March 5, 2009, 6:03pm

mark hunte:

StefanK:

stefcyr:

Hi Mark, your 1st script does exactly what I need, the only thing I’m missing is what I get from your second script “kMDItemEncodingApplications”. I’m getting a result in that field in TextEdit that I would like to get in the 1st script. That’s all I need to get it to be perfect, any idea?

Just add the matadata parameter kMDItemEncodingApplication to the list theListCommand
and a appropriate keyword to the list theList

Thanks Stefank,
And stefcyr, It looks like you did find the bit you where looking for, "kMDItemEncodingApplications " would seems to be the PDF Producer. And the "kMDItemCreator " is actually the Content Creator as porkozone pointed out.

Updated Script
set biglist to {}
set theListCommand to {"kMDItemFSName ", "kMDItemAuthors ", "kMDItemEncodingApplications ", "kMDItemCreator ", "kMDItemTitle ", "kMDItemDescription ", "kMDItemContentCreationDate  "}
set theList to {"File Name = ", "Author = ", "PDF Producer = ", "Content Creator = ", "Title = ", "Description = ", "Content Creation Date = "}
tell application "Finder"
	set SiTem to selection
	repeat with item_a from 1 to number of items in SiTem
		set this_item to item item_a of SiTem as string
		set this_item to POSIX path of this_item
		repeat with item_b from 1 to number of items in theListCommand
			set this_kMDItem to item item_b of theListCommand as string
			set theResult to words of (do shell script "/usr/bin/mdls -name " & this_kMDItem & "-raw -nullMarker None " & quoted form of this_item)
			set this_kMDItemResult to ""
			repeat with item_c from 1 to number of items in theResult
				set this_kMDItemResult to this_kMDItemResult & item item_c of theResult & space as string
			end repeat
			copy item item_b of theList & this_kMDItemResult & return to end of biglist
		end repeat
		set last item of biglist to return as string
	end repeat
end tell
biglist as string

Thanks to all for your input! One thing: the last item in the list does not get pulled for some reason. In the examples above, the kMDItemContentCreationDate is missing from the result. If I put something else as the last item, that one is also missing.

porkozone · March 5, 2009, 6:21pm

It looks like this line

set last item of biglist to return as string

is what’s causing it. If I understand correctly, this line is taking whatever the last item is, and replacing it with a return, wiping out the last item in the process. I seem to have removed it successfully without causing other issues, but am curious if there is some reason for this line that I am not seeing?

StefanK · March 5, 2009, 7:05pm

porkozone:

set last item of biglist to return as string
is what’s causing it. If I understand correctly, this line is taking whatever the last item is, and replacing it with a return, wiping out the last item in the process. I seem to have removed it successfully without causing other issues, but am curious if there is some reason for this line that I am not seeing?

The purpose is obviously to separate the entries with a additional return.
This should kill both birds with one stone

set end of biglist to return

PS: Actually you can omit all as string coercions

mark_hunte · March 5, 2009, 7:49pm

Oops…:rolleyes:

Jeffkr · April 23, 2015, 11:09pm

StefanK,
Your little Foundation Tool CLI IS FANTASTIC!!!

Question.
Not that I actually need this functionality at this point in time. BUT. do have any plans to update this tool so it also includes the “PDF Version”? I only ask because it’s often useful to know if the PDF’s transparency is flattened via a version of 1.3.

In any event, I am using your tool to test against the presence a particular keyword we enter after we flightcheck our PDFs. If the keyword is detected my script continues with the Save function, otherwise it alert the operator.

I can’t thank you enough!

-Jeff

StefanK · April 24, 2015, 5:00am

No problem, I added the document version and number of pages.
Same Link