Watch Folder with Text Extraction? Looking for ideas.

Trying to figure out a workflow with scripting.

Hypothetically it would go something like this:

  • User opens a PDF file in Adobe Acrobat and then saves file as Plain Text into a watched folder.

  • The watched folder software would then open the plain text file and extract data

  • The extract data would then be entered/pasted into a search field of a file search application (adding “OR” after each entry)

Is this feasible??

What software would I use to watch the folder and extract the data?

The plain text data file is very basic (example). I need to capture the 7 “digits” following "[Ad Number].pdf "
and the use the entries as my search data. The field following [Ad Number].pdf will always be 7 characters however if it that field consists anything other than numbers it should be ignored (try as number)

[Ad Number].pdf 1958411 FULL CLIENT NAME1
[Ad Number].pdf 1989484 FULL CLIENT NAME2
[Ad Number].pdf 1875427 FULL CLIENT NAME3
[Ad Number].pdf 1988399 FULL CLIENT NAME4
[Ad Number].pdf 1971904 FULL CLIENT NAME5
[Ad Number].pdf 1962956 FULL CLIENT NAME6
[Ad Number].pdf 1988708 FULL CLIENT NAME7
[Ad Number].pdf 1977779 FULL CLIENT NAME8
[Ad Number].pdf 1956732 FULL CLIENT NAME9
[Ad Number].pdf 1986373 FULL CLIENT NAME10

And for those of you following along: YES this a continuation of my previous post regarding HoudahSpot. Yvan, T.spoon. Thank you everyone for your help!!

Rick260

You may try :

(*
http://macscripter.net/post.php?tid=45794
*)
use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

on findPattern:thePattern inString:theString
	set theNSString to current application's NSString's stringWithString:theString
	set theOptions to ((current application's NSRegularExpressionDotMatchesLineSeparators) as integer) + ((current application's NSRegularExpressionAnchorsMatchLines) as integer)
	set theRegEx to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:theOptions |error|:(missing value)
	set theFinds to theRegEx's matchesInString:theNSString options:0 range:{location:0, |length|:theNSString's |length|()}
	set theResult to {} -- we will add to this
	repeat with i from 1 to count of theFinds
		set theRange to (item i of theFinds)'s range()
		set end of theResult to (theNSString's substringWithRange:theRange) as string
	end repeat
	return theResult
end findPattern:inString:

on adding folder items to this_folder after receiving these_items
	# read the contents of the first added item
	set theDatas to read (item 1 of these_items)
 	# extract every groups of 7 digits from the datas
	set searchNumbers to its findPattern:"[0-9][0-9][0-9][0-9][0-9][0-9][0-9]" inString:theDatas
	# concatenate the numbers separated by " or "
	set oTIDs to AppleScript's text item delimiters
	set AppleScript's text item delimiters to " or "
	set searchtext to searchNumbers as text
	set AppleScript's text item delimiters to oTIDs
	
	(*
tell application "HoudahSpot4"
	activate
	search searchText
end tell
*)
end adding folder items to

Yvan KOENIG running Sierra 10.12.5 in French (VALLAURIS, France) mercredi 21 juin 2017 15:44:18

Amazing! That works!! :smiley:

i not sure what the script is “saying” but it works perfectly!!!

Thank you so very very much! Rick

Thanks for the feedback. I added some comments in message #2.
If the groups of 7 digits aren’t allowed to start with a 0, edit the instructions extracting them as :

 set searchNumbers to its findPattern:"[1-9][0-9][0-9][0-9][0-9][0-9][0-9]" inString:theDatas

Yvan KOENIG running Sierra 10.12.5 in French (VALLAURIS, France) mercredi 21 juin 2017 16:25:01

unfortunately i think i found a bug. :frowning:

in my original post i included a “modified” data file (i can not post the complete data file publicly)

the bug is my fault … i deleted data that needs to be “ignored” by the script

new data file example:

[Ad Number].pdf 1988376 FULL client
[Ad Number].pdf 1988789 FULL client
[Ad Number].pdf 1988817 FULL client
[Ad Number].pdf 1971769 FULL client
AdServices:DisplayAds_HiRes:1963064.pdf 1963064 FULL client
AdServices:DisplayAds_HiRes:1963368.pdf 1963368 FULL client
AdServices:DisplayAds_HiRes:1963371.pdf 1963371 FULL client
AdServices:DisplayAds_HiRes:1981696.pdf 1981696 FULL client
AdServices:DisplayAds_HiRes:1987154.pdf 1987154 FULL client
AdServices:DisplayAds_HiRes:1962956.pdf 1962956 FULL client
AdServices:DisplayAds_HiRes:1985269.pdf 1985269 FULL client
AdServices:DisplayAds_HiRes:1988154.pdf 1988154 FULL client
AdServices:DisplayAds_HiRes:1988387.pdf 1988387 FULL client

  • I need the data (the 7 digits) from lines beginning with “[Ad Number].pdf”

  • lines beginning with “AdServices:DisplayAds_HiRes:” can be ignored

  • the current script is capturing the 7 digit number from ALL lines

i am sorry this is my fault

Don’t worry. In fact it was what I coded at first.

(*
http://macscripter.net/post.php?tid=45794
*)
use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

on findPattern:thePattern inString:theString
	set theNSString to current application's NSString's stringWithString:theString
	set theOptions to ((current application's NSRegularExpressionDotMatchesLineSeparators) as integer) + ((current application's NSRegularExpressionAnchorsMatchLines) as integer)
	set theRegEx to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:theOptions |error|:(missing value)
	set theFinds to theRegEx's matchesInString:theNSString options:0 range:{location:0, |length|:theNSString's |length|()}
	set theResult to {} -- we will add to this
	repeat with i from 1 to count of theFinds
		set theRange to (item i of theFinds)'s range()
		set end of theResult to (theNSString's substringWithRange:theRange) as string
	end repeat
	return theResult
end findPattern:inString:

on adding folder items to this_folder after receiving these_items
	# read the contents of the first added item
	set theDatas to read (item 1 of these_items)
	# extract every groups of  a space, 7 digits  and a space from the datas
	set searchNumbers to its findPattern:"[ ][0-9][0-9][0-9][0-9][0-9][0-9][0-9][ ]" inString:theDatas
	# concatenate the "numbers" separated by "or"
	set oTIDs to AppleScript's text item delimiters
	set AppleScript's text item delimiters to "or"
	set searchtext to searchNumbers as text
	set AppleScript's text item delimiters to oTIDs
	
	(*
tell application "HoudahSpot4"
	activate
	search searchText
end tell
*)
end adding folder items to

Yvan KOENIG running Sierra 10.12.5 in French (VALLAURIS, France) mercredi 21 juin 2017 17:03:55

If you have AppleScript Toolbox installed:

on adding folder items to this_folder after receiving these_items
	set searchNumbers to AST find regex " [0-9]{7} " in file (item 1 of these_items)
	set oTIDs to AppleScript's text item delimiters
	set AppleScript's text item delimiters to "or"
	set searchtext to searchNumbers as text
	set AppleScript's text item delimiters to oTIDs
	
	(*
tell application "HoudahSpot4"
   activate
   search searchText
end tell
*)
end adding folder items to

Thanks DJ. You are more at ease with regex than me.

In the message #6 the search instruction may be edited as :

set searchNumbers to its findPattern:" [0-9]{7} " inString:theDatas

Yvan KOENIG running Sierra 10.12.5 in French (VALLAURIS, France) mercredi 21 juin 2017 19:10:35

still acting bugged

text file before scripting (i included the page header. don’t know if that makes a difference)

XYZ Fri 06-23-17_26p10-10-6.als Runsheet: Wednesday, June 21, 2017 File Name Ad Number Zon e Nam e Col umns Dept h Comme nt Col or On Page
[Ad Number].pdf 1989897 –
[Ad Number].pdf 1989212 FULL zzzzzz
[Ad Number].pdf 1971708 FULL zzzzzz
[Ad Number].pdf 1984276 FULL zzzzzz
[Ad Number].pdf 1989509 FULL zzzzzz
[Ad Number].pdf 1984935 FULL zzzzzz
AdServices:DisplayAds_HiRes:1963064.pdf 1963064 FULL xxxxxxxxxxx
AdServices:DisplayAds_HiRes:1963368.pdf 1963368 FULL xxxxxxxxxxx
AdServices:DisplayAds_HiRes:1963371.pdf 1963371 FULL xxxxxxxxxxx
AdServices:DisplayAds_HiRes:1972431.pdf 1972431 FULL xxxxxxxxxxx
AdServices:DisplayAds_HiRes:1988918.pdf 1988918 FULL xxxxxxxxxxx
AdServices:DisplayAds_HiRes:1989484.pdf 1989484 FULL xxxxxxxxxxx
AdServices:DisplayAds_HiRes:1987813.pdf 1987813 FULL xxxxxxxxxxx
AdServices:DisplayAds_HiRes:1977779.pdf 1977779 FULL xxxxxxxxxxx
AdServices:DisplayAds_HiRes:1988387.pdf 1988387 FULL xxxxxxxxxxx
AdServices:DisplayAds_HiRes:1950976.pdf 1950976 FULL xxxxxxxxxxx
AdServices:DisplayAds_HiRes:1950978.pdf 1950978 FULL xxxxxxxxxxx
AdServices:DisplayAds_HiRes:1950979.pdf 1950979 FULL xxxxxxxxxxx

after scripting:

1989897 or 1989212 or 1971708 or 1984276 or 1989509 or 1984935 or 1963064 or 1963368 or 1963371 or 1972431 or 1988918 or 1989484 or 1987813 or 1977779 or 1988387 or 1950976 or 1950978 or 1950979

expected result:

1989897 or 1989212 or 1971708 or 1984276 or 1989509 or 1984935

This regex string may do it.


It’s a white space, seven digits, and another space which aren’t preceded by “AdServices:DisplayAds_HiRes:”, seven digits, “.pdf”, and zero to five other spaces.

I will try it. I want to make sure i edit the correct line

change:

set searchNumbers to its findPattern:“[ ][0-9][0-9][0-9][0-9][0-9][0-9][0-9][ ]” inString:theDatas

to:

set searchNumbers to its findPattern:“(?<!AdServices:DisplayAds_HiRes:[0-9]{7}\.pdf\s{0,5})\s[0-9]{7}\s” inString:theDatas

It’s that rick.

Just a question : are the leading and the ending space characters wrongdoers?
At this time the script returns :
" 1989897 or 1989212 or 1971708 or 1984276 or 1989509 or 1984935 "

You may drop them with :

set searchNumbers to its findPattern:"(?<!AdServices:DisplayAds_HiRes:[0-9]{7}\\.pdf\\s{0,5})\\s[0-9]{7}\\s" inString:theDatas
# concatenate the "numbers" separated by "or"
set oTIDs to AppleScript's text item delimiters
set AppleScript's text item delimiters to "or"
set searchtext to searchNumbers as text
set AppleScript's text item delimiters to oTIDs
set searchtext to text 2 thru -2 of searchtext

I would appreciate explanations about the regex posted by Nigel.

Yvan KOENIG running Sierra 10.12.5 in French (VALLAURIS, France) mercredi 21 juin 2017 20:28:50

YES! that worked!!! Great!! THANK YOU!!

What would happen if the script encountered anything other than a number after “[Ad Number].pdf”?

instead of “[Ad Number].pdf 1989897” the script found “[Ad Number].pdf ???” or less than 7 digits? or even blank spaces?

i am attempting to read through the script to see if it checks the result as number but i am curious if it encounters anything other than a 7 digit number

OK.


It’s basically in two parts. The section at the end .

. is similar to your original regex, except that “\s” is a shorthand for any white-space character. This can be a space, tab, or line ending, but I don’t think it will cause any false hits if rick_260’s accurately described his text. In the actual regex, of course, there’s only one backslash in front of the “s”, but we use two to escape it in an AppleScript string.

The other part is the section at the beginning:


A parenthesised section of regex beginning with “(?<!” is what’s known as a “negative lookbehind”. It doesn’t count as part of the match itself, but means that text matching the regex we do want to match (ie. “\s[0-9]{7}\s”) doesn’t count as a match if it’s preceded by text matching the contents of the negative lookbehind.

Two points of interest within this particular negative lookbehind:

  1. “\.” ” The dot’s escaped so that it counts as a literal full stop (or period) and not the regex metacharacter “.” which stands for any character.
  2. “\s{0,5}” ” As posted, rick_260’s text has two spaces between each PDF name and the following number. In order to hedge my bets over what’s actually in the text, I’ve allowed for up to six white-space characters between the file name and the number: the space immediately before the number and up to five before that. It’s not permitted to use unlimited repeats such as “*” or “+” in a lookbehind, but it’s OK to specify a possible range. So “\s{0,5}” limits the possible number of white-space characters to be matched to between zero and five.

As it stands, the regex matches every sequence of seven digits bracketed by white space characters wherever they occur in the text, except after the file paths you’ve asked to have excluded. Anything after “[Ad Number].pdf” which isn’t such a sequence won’t be matched. Anyone writing a regex to match any situation other than the one you’ve already described will need to know exactly what’s there, format-wise.

If a row doesn’t contain a group of 7 digits enclosed between spaces, nothing would be extracted from it.

Yvan KOENIG running Sierra 10.12.5 in French (VALLAURIS, France) mercredi 21 juin 2017 21:41:54