Hello all (greetings from South Africa),
My first post - and hoping there will be many more.
I am so new to all of this, it only dawned on me in the past week that there is a difference between Applescript and / everything else shell script related. I’ve just re-read that, and now I’m not even sure I have that part correct… But suffice it to say, I’m so much of a beginner, I don’t even know enough to know when I’m barking up the wrong tree in the wrong forest on the wrong continent… :rolleyes:
I’m hoping someone can help me out. I’ve popped up a query on the Hazel site - but as explained, I don’t really know where to begin, in order to get the required help.
As an aside - I’ve downloaded several resources that were kindly popped up on the MacScripter site, for beginners, and am starting to get my head around everything there… But it’s going to take a while - so hopefully someone can speed up the learning curve.
The background:
1.) I’m in the throws of completing my PHD. I have many thousands of pdf files on my Mac, many of which have been OCR’d and are therefore searchable, but many are obviously not.
2.) The plan was to get Hazel to automate much of the following - find pdf files that must still be OCR’d, send them to PDFpen to be OCR’d, label them, and then move them back to a new set of subfolders.
3.) I have managed to work out every step in Hazel, barring what I thought would be the easy one - checking if a file needs to be OCR’d.
This is where I need some divine programming intervention!
I have looked through the metadata of several PDFs, both those that have been OCR’d, and those that must still be done, in order to try and find metadata information that might distinguish the two. Unfortunately, nothing conclusive showed up.
It was then suggested I open the pdf with “textedit”, and have a look through the contents of the file in plain format.
Here I noticed the following:
a.) If the terms “Encoding” and “Decodeparms” are present TWO times or less, in the plaintext body of the pdf file - then there is a very high likelihood that the file has NOT been OCR’d.
b.) The plan was accordingly to get Hazel to open files moved into a watched folder, run the textedit (or similar search) on the pdf file, and count whether or not “encoding” and/or “decodeparms” appear TWO times or less in the plaintext data.
c.) IF so - Hazel was then to invoke the PDFpen rule, and send the relevant file to be OCR’d [as mentioned - I have managed to get this part working].
My attempts so far:
i.) I first tried the “grep” function, as a shell script.
Hazel uses “$1” to refer to the file being processed - and a “exit with status 0” indicates a match.
Since Hazel only acts on a file if it is matched - for my scenario to work, the “match” must arise where the term “encoding” is present TWO times or less (or not at all) within the plaintext data of the pdf.
if [$(grep -ci "encoding" "$1") -gt 2];
then
echo 1;
else
echo 0
fi
Someone kindly explained the above as meaning that the “grep” function will then run -ci [case insensitive], and check if the term “encoding” appears a -gt [greater than] 2 times in the plaintext - if so, echo 1, else echo 0.
I have tried this in a variety of forms, but cannot get it to work.
The one plaintext pdf has “encoding” appear TWICE, the other has it appear 28 times - but both are labeled, as opposed to only labeling/acting on the former, as was hoped…
This evening, it was pointed out that I might try using Applescript instead of shell script.
Hazel would then use “theFile” to refer to the file being processed, and requires a “return of true” to indicate a match.
I would naturally attempt this - but for the simple point that I have no idea where to begin getting the Shell Script to look like an Applescript - assuming of course that the former was correct to begin with…
It was furthermore suggested that I look around for “sh” and “bash scripting”… which means the square root of nothing to me…
Hopefully all the above makes sense to some kind guru samaritan, who might be able to give me some tips - in a manner that a complete neophyte can understand…