Simple(?) grep script for Hazel...

Cassady · March 25, 2013, 9:12pm

Hello all (greetings from South Africa),

My first post - and hoping there will be many more.

I am so new to all of this, it only dawned on me in the past week that there is a difference between Applescript and / everything else shell script related. I’ve just re-read that, and now I’m not even sure I have that part correct… But suffice it to say, I’m so much of a beginner, I don’t even know enough to know when I’m barking up the wrong tree in the wrong forest on the wrong continent… :rolleyes:

I’m hoping someone can help me out. I’ve popped up a query on the Hazel site - but as explained, I don’t really know where to begin, in order to get the required help.

As an aside - I’ve downloaded several resources that were kindly popped up on the MacScripter site, for beginners, and am starting to get my head around everything there… But it’s going to take a while - so hopefully someone can speed up the learning curve.

The background:

1.) I’m in the throws of completing my PHD. I have many thousands of pdf files on my Mac, many of which have been OCR’d and are therefore searchable, but many are obviously not.

2.) The plan was to get Hazel to automate much of the following - find pdf files that must still be OCR’d, send them to PDFpen to be OCR’d, label them, and then move them back to a new set of subfolders.

3.) I have managed to work out every step in Hazel, barring what I thought would be the easy one - checking if a file needs to be OCR’d.

This is where I need some divine programming intervention!

I have looked through the metadata of several PDFs, both those that have been OCR’d, and those that must still be done, in order to try and find metadata information that might distinguish the two. Unfortunately, nothing conclusive showed up.

It was then suggested I open the pdf with “textedit”, and have a look through the contents of the file in plain format.

Here I noticed the following:

a.) If the terms “Encoding” and “Decodeparms” are present TWO times or less, in the plaintext body of the pdf file - then there is a very high likelihood that the file has NOT been OCR’d.

b.) The plan was accordingly to get Hazel to open files moved into a watched folder, run the textedit (or similar search) on the pdf file, and count whether or not “encoding” and/or “decodeparms” appear TWO times or less in the plaintext data.

c.) IF so - Hazel was then to invoke the PDFpen rule, and send the relevant file to be OCR’d [as mentioned - I have managed to get this part working].

My attempts so far:

i.) I first tried the “grep” function, as a shell script.
Hazel uses “$1” to refer to the file being processed - and a “exit with status 0” indicates a match.
Since Hazel only acts on a file if it is matched - for my scenario to work, the “match” must arise where the term “encoding” is present TWO times or less (or not at all) within the plaintext data of the pdf.

if [$(grep -ci "encoding" "$1") -gt 2];
then
echo 1;
else
echo 0
fi

Someone kindly explained the above as meaning that the “grep” function will then run -ci [case insensitive], and check if the term “encoding” appears a -gt [greater than] 2 times in the plaintext - if so, echo 1, else echo 0.

I have tried this in a variety of forms, but cannot get it to work.
The one plaintext pdf has “encoding” appear TWICE, the other has it appear 28 times - but both are labeled, as opposed to only labeling/acting on the former, as was hoped…

This evening, it was pointed out that I might try using Applescript instead of shell script.

Hazel would then use “theFile” to refer to the file being processed, and requires a “return of true” to indicate a match.

I would naturally attempt this - but for the simple point that I have no idea where to begin getting the Shell Script to look like an Applescript - assuming of course that the former was correct to begin with…

It was furthermore suggested that I look around for “sh” and “bash scripting”… which means the square root of nothing to me…

Hopefully all the above makes sense to some kind guru samaritan, who might be able to give me some tips - in a manner that a complete neophyte can understand…

DJ_Bazzie_Wazzie · March 25, 2013, 9:36pm

edit: First of all Welcome to MacScripter

Something like?

set theFile to (choose file)'s posix path
do shell script "grep -i 'encoding' " & quoted form of theFile & " &>/dev/null && echo 1 || echo 0"

McUsrII · March 25, 2013, 10:10pm

Hello

How would it be, to just scan through all your pdf’s that are reachable by spotlight, creating one long listing with the word OCR prepending those entries that are probably scanned?

Your machine would have to work for a while, but then you’d have the complete overview, dumb brute force…

DJ_Bazzie_Wazzie · March 25, 2013, 11:03pm

Please, not on my machine. I have thousands of books on my machine

McUsrII · March 25, 2013, 11:10pm

Hopefully the OP has stored the papers aside from any Book collection. I gueess a paper is usually between 5-30 pages.

We can also lower the priority slightly, and just have it going in the background until it is finished, I have no idea how long it takes to OCR a single pdf document.

Cassady · March 26, 2013, 5:44am

DJ Bazzie Wazzie:

edit: First of all Welcome to MacScripter

Something like?

set theFile to (choose file)'s posix path
do shell script "grep -i 'encoding' " & quoted form of theFile & " &>/dev/null && echo 1 || echo 0"

Thanks so much - will give this a try as soon as possible, and pop up the results over here.

Could I ask an incredibly silly question though:

Do I type everything as I see it into my script window, or have you used some interchangeable terms - i.e. do I type in “to (choose file)'s” as is - or is this where I am supposed to insert a reference to the file being activated?

Again - I cringe in reading the above, since it’s no doubt so obvious to anyone with a ounce of experience in scripting - but I have zip!

[b]Wait - I’m already lost! Just tried this - have no idea how to generate the “two lines” in between “echo 1 … echo 0”. That’s a reference for me to do what exactly? Enter/Return? [/b]

Cassady · March 26, 2013, 5:53am

Hi!

I contemplated going the old-skool way, of simply moving the pdfs that I have literally opened, searched and thereby confirmed as not yet being searchable/OCR’d, to a [“To Be OCR’d”] folder - but it really is a laborious task.

It involves me having to open 30-50 files with PDFpen - any more and things start taking too long - hitting Ctrl+F, typing in “the/and/unions etc.”, and checking if the file is searchable. If not - I must then invoke the OCR process manually.

I have been doing this in any event over the past weeks, whilst trying to get my Hazel script working, and it is pretty soul-destroying stuff!

As mentioned - I have thousands of pdf’s, and going this route, whilst it would obviously work - will take many hours away from writing and researching… Which is why I’m hoping to have Hazel do it for me, in the background, as it were.

Cassady · March 26, 2013, 6:03am

Thankfully I have a complete series of nested folders/sub-folders, as I am a pretty meticulous “folder-driven” data storer/searcher, so it should be manageable.

I only came over to Mac mid-2012, having only ever known Windows. I’m kicking myself for not having switched years ago. Had I known what I now know is possible on a Mac, when I started researching - my life would have been exponentially easier!

As it stands, I’m now trying to make up lost time, with the help of various software options - and kind script-gurus!

Most papers max out at about 30 pages… Here and there I have scanned sections of textbooks which are longer, as are some of the ‘serious’ papers from some of the more-established journals. It can take anything from 30 seconds to (+/-) 90 seconds to OCR the entire document, depending entirely on the quality of the original scan.

Fortunately, it appears Hazel can be instructed to wait, upon completion of an action. My grand plan is to build in a “pause” section into the relevant rule, which kicks in when pdf’s are moved into a particular folder. Hazel will then scan the pdf to check if it must be OCR’d, invoked PDFpen to do the necessary - if required, and then - when all has been done, move everything out into a separate folder. If I manage the “pause” correctly, it shouldn’t be too heavy on the system, and should be able to run in the background… At least - that’s the plan…

McUsrII · March 26, 2013, 7:33am

Hello.

Please look at this post for the nitty gritty details, (especially about logging, and how to get to look at your log.

Assuming your script is just one of several steps, and that you fill in the /bin/bash as the shell, the code to paste into that special rule should be:

a=$(grep -ci "encoding" "$1") if [ x$a = x ]; then exit 1; fi if [ $a -gt 1 ]; then exit 0 else exit 1 fi
I hope this works, and by the way, it may have almost worked as it was, but you have to have spaces between the brackets to get bash to parse it correctly.

Cassady · March 26, 2013, 10:18am

You are a rock-star!

I think it’s working!!!

I took a bit of a gamble, assuming that if “-gt 2” means greater than, then “-lt 3” would read less than…

So inbedded the following into Hazel, and had as my ACTION for Hazel to label the file(s) that “pass the script” Purple.

a=$(grep -ci "encoding" "$1")
if [ x$a = x ];
then 
exit 1;
fi
if [ $a -lt 3 ];
then
exit 0
else 
exit 1
fi

As indicated in the dropbox screengrab:
http://dl.dropbox.com/u/48750317/Hazel%20magic.png

Hazel has ignored the PDF where “encoding” was found 28 times in the plaintext, and run the action on the two PDF’s where “encoding” appeared two times/less/not at all…

Thank you SO much…

If there is any interest, I will pop up a pic of the completed Hazel workflow here, when I’ve set it up.

Not entirely sure if that will be permitted - but shout if I may!
[Will also pay my MacScripter respects over on the Hazel forum!!]

[editing to add much more happiness: :lol: :lol: :lol: :lol: :lol: :lol: :lol: ]

McUsrII · March 26, 2013, 10:28am

Hello.

I suggest you change to this, so you won’t test pdf’s with less than one occurrence of encoding:

a=$(grep -ci "encoding" "$1") if [ x$a = x ]; then exit 1; fi if [ $a -lt 3 ]; then if [ $a -gt 1 ]; then exit 0 else exit 1 fi else exit 1 fi
I am glad it works for you!

Happy Easter

Cassady · June 20, 2013, 2:19pm

Hello all…

It is a blue-Thursday indeed… Noodlesoft updated Hazel with a comprehensive bug-fix and core-system/services revamp, which no-doubt is going to make many people very happy - but alas, I am currently not one of them…

The script that was so kindly sorted out for me above, now no longer works… It appears that Hazel now prefers a shell running bash (if that makes sense), as opposed to sh…

Could anyone here be kind enough to confirm if I need to convert the above script to… bash?? I don’t have the faintest idea what any of this means… And whereas I have every intention of picking it up - that project is on the back-burner until my PHD is completed…

I was using Hazel to batch convert my research-pdf’s into searchable ones, so this issue - timing wise - is pretty inconvenient…

It’s reached the point, where I am seriously considering using time-machine to go back a few days, prior to my Hazel update…

Any suggestions would be very much appreciated!!

McUsrII · June 20, 2013, 3:32pm

Hello.

Please post whatever you have been using, in hazel, that now ceased to work, and we’ll see what we can do. What is above, should really run under bash as well, because sh is indeed bash.
So please post what you got, the errors you have got, word by word, and we’ll see if we can fix it.

Edit

Please go back to this post again there you will see in the first image that the shell they use is /bin/bash try to change that into likewise, and see if that cures your problem.

I think this might be something else, but that would be the easy solution. -And they are the best when they work.

hth

But please do come back soon with your errormessage, if that doesn’t fix it.