I’ve been reading a bit on the Text Item Delimeters on the forum, as i’m really new to them. I thought it may help me with a small issue i have but i can’t figure out how to use them in this scenario. Alot of the stuff i’ve read on them, already have the text within the applescript to amend.
I don’t know if this is possible or i’m looking at the wrong type of Applescript, but i’m wanting my script to read through a pdf file and find the paths of linked images and set them as variables so i automatically collect them later…
I will be using different pdfs, with different linked images, the only thing that is common is that all the images are held on the same server.
so…
inside the pdf, the start of the linked file line would be: ‘%%DocumentFiles:/Volumes/MyServer/etc/etc’ - i would like to take this line from the pdf and then have my applescript collect the linked image (which is the easy part :D) i’ve just got no idea how to get this variable line from my pdf?..
i just pretty confused and don’t know if this is the right way to go about it? A point in the right direction would be great help…
This is easy to do if and only if a file is readable as plain or unicode text. A PDF file is not, and Adobe Reader is not scriptable so you have no way of applying TIDs. I might have a workaround and I’ll post back.
set F to choose file default location (path to documents folder) without invisibles
tell application "Preview" to open F
delay 2
activate application "Preview"
tell application "System Events" to tell process "Preview"
keystroke "a" using {command down}
delay 3
keystroke "c" using {command down}
end tell
set R to the clipboard
-- and so on with the TID extractions
You may have to fiddle with the delays. Some PDFs simply don’t show up in Preview, and others are not copyable.
choose file with prompt "Get image data for this PDF:" without invisibles
set thePDF to result
try
do shell script "/usr/bin/strings " & quoted form of POSIX path of thePDF & ¬
" | /usr/bin/grep '^%%' | /usr/bin/grep --only-matching '/.*'"
set grepStrings to paragraphs of result
on error
display alert "No Image Data" message "No image data could be found in that file." buttons {"Cancel"} default button "Cancel"
error number -128 -- cancel
end try
set imageList to {}
repeat with thisItem in grepStrings
try
-- Try to filter out results that aren't files (e.g. dates)
set end of imageList to (POSIX file thisItem) as alias
end try
end repeat
If you want the results to be POSIX paths, then you could change the repeat loop:
repeat with thisItem in grepStrings
try
(POSIX file thisItem) as alias -- Try to filter out results that aren't files (e.g. dates)
set end of imageList to thisItem
end try
end repeat
there’s a scriptable 15$ shareware File Juicer, which is able to extract images from PDF files
set f to choose file with multiple selections allowed
tell application "File Juicer"
juice files f results location on the desktop with showing results
end tell
I tried to use the ‘grep’ shell command, but i couldn’t manage to get only the paths for the linked images, i just got the whole text information of the file or nothing at all… that’s because it’s a binary file?
Yes. By default, grep returns the entire line that the match was found in; A binary file wouldn’t have multiple lines. The --only-matching (or -o) option will (obviously?) output only what was matched by the pattern; However, you wouldn’t be able to come up a pattern that would determine the end of the file path.
What strings does is “find the printable strings in a object”; That is, it looks for what might be readable text in a (usually binary) file. The important part for my script is that each string that is found is output on it’s own line. This means we don’t have to use grep to find the end of the file path.
This is only useful for a PDF that is referencing external files. (Side note: I added a nicer error message in my script above.)
In case you’re curious, here’s a small sample of the strings output: