Script to get metadata of a PDF

stefcyr · March 1, 2009, 11:32pm

Works great now! You guys are geniuses!

porkozone · March 5, 2009, 6:03pm

mark hunte:

StefanK:

stefcyr:

Hi Mark, your 1st script does exactly what I need, the only thing I’m missing is what I get from your second script “kMDItemEncodingApplications”. I’m getting a result in that field in TextEdit that I would like to get in the 1st script. That’s all I need to get it to be perfect, any idea?

Just add the matadata parameter kMDItemEncodingApplication to the list theListCommand
and a appropriate keyword to the list theList

Thanks Stefank,
And stefcyr, It looks like you did find the bit you where looking for, "kMDItemEncodingApplications " would seems to be the PDF Producer. And the "kMDItemCreator " is actually the Content Creator as porkozone pointed out.

Updated Script
set biglist to {}
set theListCommand to {"kMDItemFSName ", "kMDItemAuthors ", "kMDItemEncodingApplications ", "kMDItemCreator ", "kMDItemTitle ", "kMDItemDescription ", "kMDItemContentCreationDate  "}
set theList to {"File Name = ", "Author = ", "PDF Producer = ", "Content Creator = ", "Title = ", "Description = ", "Content Creation Date = "}
tell application "Finder"
	set SiTem to selection
	repeat with item_a from 1 to number of items in SiTem
		set this_item to item item_a of SiTem as string
		set this_item to POSIX path of this_item
		repeat with item_b from 1 to number of items in theListCommand
			set this_kMDItem to item item_b of theListCommand as string
			set theResult to words of (do shell script "/usr/bin/mdls -name " & this_kMDItem & "-raw -nullMarker None " & quoted form of this_item)
			set this_kMDItemResult to ""
			repeat with item_c from 1 to number of items in theResult
				set this_kMDItemResult to this_kMDItemResult & item item_c of theResult & space as string
			end repeat
			copy item item_b of theList & this_kMDItemResult & return to end of biglist
		end repeat
		set last item of biglist to return as string
	end repeat
end tell
biglist as string

Thanks to all for your input! One thing: the last item in the list does not get pulled for some reason. In the examples above, the kMDItemContentCreationDate is missing from the result. If I put something else as the last item, that one is also missing.

porkozone · March 5, 2009, 6:21pm

It looks like this line

set last item of biglist to return as string

is what’s causing it. If I understand correctly, this line is taking whatever the last item is, and replacing it with a return, wiping out the last item in the process. I seem to have removed it successfully without causing other issues, but am curious if there is some reason for this line that I am not seeing?

StefanK · March 5, 2009, 7:05pm

porkozone:

set last item of biglist to return as string
is what’s causing it. If I understand correctly, this line is taking whatever the last item is, and replacing it with a return, wiping out the last item in the process. I seem to have removed it successfully without causing other issues, but am curious if there is some reason for this line that I am not seeing?

The purpose is obviously to separate the entries with a additional return.
This should kill both birds with one stone

set end of biglist to return

PS: Actually you can omit all as string coercions

mark_hunte · March 5, 2009, 7:49pm

Oops…:rolleyes:

Jeffkr · April 23, 2015, 11:09pm

StefanK,
Your little Foundation Tool CLI IS FANTASTIC!!!

Question.
Not that I actually need this functionality at this point in time. BUT. do have any plans to update this tool so it also includes the “PDF Version”? I only ask because it’s often useful to know if the PDF’s transparency is flattened via a version of 1.3.

In any event, I am using your tool to test against the presence a particular keyword we enter after we flightcheck our PDFs. If the keyword is detected my script continues with the Save function, otherwise it alert the operator.

I can’t thank you enough!

-Jeff

StefanK · April 24, 2015, 5:00am

No problem, I added the document version and number of pages.
Same Link

Jeffkr · April 24, 2015, 2:42pm

Hi StefanK,
After downloading the new app and overwriting the old with the new in the same location, the script event log is not returning any of the document’s properties?

Replacing the old PDFMetadate file does yield correct results where the script reads the data.

On a side note, Acrobat is interesting. Let’s assume I have a document that has the text, “Document has been flightchecked” as its Keyword in its Document Properties. If I were to then completely deleted this Keyword from the Properties field, the document is still retaining the deleted string somehow, somewhere, even if I were to save the file to a new name. It’s not until I add a new character in the Keywords field that the old string gets replaced. This is a non-issue, but I still thinks its odd that Acrobat Pro 9 behaves this way.

StefanK · April 24, 2015, 2:53pm

As the project is quite old Xcode updated some project settings.
I recompiled the CLI with e deployment target of 10.5 and 32/64 bit universal architecture.
Always same link.

nellbern · January 25, 2018, 12:30am

I have been struggling with this script for years. The whole idea is to get a file count of the PDFs in the selected folder, PDFs that start with the letter R & finally a file count of PDFs that contains the keyword correction. The script creates a text file with this info. The part I’m having problems trying to get a file count of the pdfs that contain the keyword
correction. I’m getting a = (null) result

set target_folder to choose folder with prompt "Choose target folders containing only PDFs to count files" with multiple selections allowed without invisibles
set results to ""

repeat with i from 1 to (count target_folder)
	set thisFolder to (POSIX path of item i of target_folder)
	
	--Find & count all PDFs in the folders selected that DON'T starts with letter R
	set fileCount to do shell script "find " & quoted form of thisFolder & " -type f  -iname *.pdf | wc -l"
	set results to (results & "" & thisFolder & "=" & tab & fileCount & tab)
	
	--Find & count all PDFs in the folders selected that PDF file name starts with letter R
	set fileCount to do shell script "find " & quoted form of thisFolder & " -type f -iname 'R[0-9-_]*.pdf' | wc -l"
	set results to (results & "" & tab & "RESENDS=" & tab & fileCount & tab)
	
--THIS IS THE PART I'M HAVING PROBLEMS
	--Find & count all PDFs in the folders selected that keyword is correction
	set fileCount to do shell script "mdls -name " & "kMDItemKeywords" & "-raw -nullMarker None " & quoted form of thisFolder --& " -type f  -iname *.pdf | wc -l"
	set results to (results & "" & tab & "CORRECTION=" & tab & fileCount & return)
	
end repeat


--write results to a txt file
set theFilePath to (path to desktop folder as string) & "PDF File Count.txt"
set theFile to open for access file theFilePath with write permission
try
	set eof of theFile to 0
	--write results to file theFilePath
	write results to theFile
	close access theFile
on error
	close access theFile
end try
display dialog "done" giving up after "1"

Model: iMac (Retina 5K, 27-inch, Late 2015)
AppleScript: 2.8.1 (183.1)
Browser: Safari 537.36
Operating System: Mac OS X (10.11.6)

Nigel_Garvey · January 25, 2018, 11:36am

I think what you want may look something like this:

--Find & count all PDFs in the folders selected that keyword is correction
set fileCount to do shell script "find " & quoted form of thisFolder & " -type f  -iname *.pdf -print0 | xargs -0 mdls -name kMDItemKeywords | grep -i '\\bcorrection\\b' | wc -l"

find’s -print0 primary outputs character code 0 after each path instead of a linefeed. xargs’s -0 option makes it expect character code 0 as the separator instead of linefeeds and spaces. xargs itself calls the mdls function with each path. The grep command case-insensitively matches the complete word “correction”.

nellbern · January 26, 2018, 1:00am

I made the modifications to the script & it didn’t work. I use Adobe Bridge to add metadata. I can clearly see the word correction in the keywords. However, I noticed when I open the PDF in Acrobat & checked Document Properties in the Keywords section I noticed that Acrobat is adding a literal semicolon followed by a space & then the word correction. Example ; correction

So if I delete the semicolon ; & the space just leave the word correction then your code works but defeats the purpose of batching metadata in Bridge. I tried to modify the grep search to find ; followed by a space then correction but it still fails. My point is that Acrobat is adding additional characters when metadata is applied using Bridge. This is totally out my league & would say very advance for me but can you help me once again if what I added is correct. grep -i ‘\b\;\^correction\b’

set target_folder to choose folder with prompt "Choose target folders containing only PDFs to count files" with multiple selections allowed without invisibles
set results to ""

repeat with i from 1 to (count target_folder)
	set thisFolder to (POSIX path of item i of target_folder)
	--Find & count all PDFs in the folders selected that keyword is correction
	set fileCount to do shell script "find " & quoted form of thisFolder & " -type f  -iname *.pdf -print0 | xargs -0 mdls -name kMDItemKeywords | grep -i '\\b\\;\\^correction\\b' | wc -l"
	set results to (results & "" & tab & "CORRECTION=" & tab & fileCount & return)
	
end repeat

--write results to a txt file
set theFilePath to (path to desktop folder as string) & "PDF File Count.txt"
set theFile to open for access file theFilePath with write permission
try
	set eof of theFile to 0
	--write results to file theFilePath
	write results to theFile
	close access theFile
on error
	close access theFile
end try

Shane_Stanley · January 26, 2018, 8:39am

This uses my MetadataLib script library:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions
use script "Metadata Lib" version "2.0.0"

set theFolders to choose folder with prompt "Choose target folders containing only PDFs to count files" with multiple selections allowed without invisibles

set theFiles to perform search in folders {theFolders} predicate string "kMDItemContentType == %@ AND kMDItemKeywords CONTAINS[cd] %@" search arguments {"com.adobe.pdf", "; correction"}

You can get the library here:

https://www.macosxautomation.com/applescript/apps/Script_Libs.html

Nigel_Garvey · January 26, 2018, 11:05am

I’m guessing this means that 0 was returned for the number of files having “correction” amongst their kMDItemKeywords.

That’s a bit of a mystery. The grep code returns any line containing “correction” as a complete word (not bounded by other “word” characters such as letters, digits, or underscores) . So the presence of a semicolon and a space should make no difference. My only guesses are that either there are invisible characters in the word “correction” you see on screen (maybe it’s wrongly encoded) or the space is actually some other character which grep considers to be a word character. It may be possible to find out by using this script:

set f to (choose file of type "com.adobe.pdf" with prompt "Choose a PDF file whose kMDItemKeywords you know contains the word \"correction\" …")
do shell script "mdls -name kMDItemKeywords " & quoted form of POSIX path of f
return id of result

Look out for any unusually high or unusually low numbers in the result.

Otherwise it’s worth giving Shane’s library a try. His script returns a list of the paths to the matching files, which would then have to be counted. (But ‘search string’ should be ‘predicate string’ with version 2.0.0.)

Shane_Stanley · January 26, 2018, 11:13am

Indeed it should. Thanks.

Nigel_Garvey · January 27, 2018, 12:42pm

I’ve gone through all the PDFs I can find on my computer to see what mdls returns for their kMDItemKeywords. Only two (not created by me) have keywords beginning with "; ". In both cases, the space is a normal space and my grep code matches them when “correction” is replaced with the relevant text.

The PDF for Shane’s book “Everyday AppleScriptObjC” has a keyword which contains the copyright symbol “©”. This character is outside the normal “ASCII” range and is returned by mdls as “\U00a9”, so grep only recognises it if it’s searching for this. Shane’s script, on the other hand, only recognises it if it’s searching for the copyright symbol itself.

So if, by the merest chance, the space in "; " happens to be a no-break space (character id 160), it’s likely that mdls will render it as “\U00a0”, in which case my grep code won’t work. It’s likely too that Shane’s script won’t work either unless a no-break space is used in the search argument.

On the off-chance that no-break spaces are the problem, here’s a revised line. The egrep code matches both “correction” as a complete word and “\U00a0correction”.

--Find & count all PDFs in the folders selected that keyword is correction
set fileCount to do shell script "find " & quoted form of thisFolder & " -type f -iname *.pdf -print0 | xargs -0 mdls -name kMDItemKeywords | egrep -i '(\\\\U00a0|\\b)correction\\b' | wc -l"

Marc_Anthony · January 27, 2018, 4:55pm

Hmm. This is interesting. Searching files to which I added the keyword “Correction,” I can acceptably locate that metadata for file classes such as JPG and TIFF. My code below should also work for PDFs, but doesn’t—at least not for PDFs that were edited with CS3’s Bridge app. Metadata returned (by the mdls command for one PDF) lists the usual suspects, however, the keyword “Correction” fails to appear at all. Perhaps the XML is somehow mangled?

count (do shell script "mdfind -onlyin " & my (choose folder)'s POSIX path's quoted form & " kMDItemKeywords == 'Correction' ")'s paragraphs

edit: Nigel’s code in post #29 also returns the incorrect entry for my test PDFs—" 0"—while correctly returning " 2" for the JPEGs.

Nigel_Garvey · January 27, 2018, 6:17pm

Hi Marc.

Thanks. It’s good to get the input of someone who actually has Bridge!

When you say that mdls fails to register “Correction” at all with PDFs, is that just under kMDItemKeywords or under any heading? Your post has prompted me to wonder if it might be actually be under some other heading. Just a thought. Clutching at straws.

set f to (choose file of type "com.adobe.pdf" with prompt "Choose a PDF file whose kMDItemKeywords you know should contain the word \"correction\" …")
do shell script "mdls " & quoted form of POSIX path of f
--> All the metadata visible to mdls.

Marc_Anthony · January 27, 2018, 6:42pm

I can see that my entry was appended in the plain contents as viewed in TextEdit—but it doesn’t register under any heading that mdls reports as being metadata.

Shane_Stanley · January 27, 2018, 8:03pm

I suspect you’re dealing with Adobe’s XMP metadata, which isn’t searchable via Spotlight.