Monday, July 22, 2019

#1 2019-07-11 09:16:46 pm

t.spoon
Member
From:: BFE, Massachusetts
Registered: 2013-01-13
Posts: 411

Slow Text Processing removing leading or trailing text

I stayed stuck in an old discontinued Recipe database for too long (Yum!)... I can't find anything that imports it's files... and it doesn't seem to export as anything that other recipe managers I've looked at want to import.

So... lots of text processing. You people here who are aces with RegEx probably could have knocked out this project in 10 minutes... heck, I'm starting to think I could have done it faster if I'd just learned RegEx. Anyway, if all you have is a hammer...

So, one of my functions is this:

Applescript:

-- "position" should be "front," "back," or "both" to specify whether trimming the fronts or backs of each line
on truncate_lines_if(theText, truncateText, position)
   set oRet to "
"

   set theOutput to ""
   set charCount to count of characters of truncateText
   if position is "front" or position is "both" then
       repeat with aLine in paragraphs of theText
           if aLine starts with truncateText then
               set newLine to text (charCount + 1) through end of aLine
           else
               set newLine to aLine
           end if
           set theOutput to theOutput & newLine & oRet
       end repeat
       return theOutput
   else if position is "back" or position is "both" then
       repeat with aLine in paragraphs of theText
           if aLine ends with truncateText then
               set lineCount to (count of characters of contents of aLine)
               set newLine to text 1 through (lineCount - charCount) of aLine
           else
               set newLine to aLine
           end if
           set theOutput to theOutput & newLine & oRet
       end repeat
   end if
   return theOutput
end truncate_lines_if

I was already doing about a hundred manipulations on the whole database and my script run time was 0.18 seconds... adding a single call to this function took the run time up to half a minute. Which would be OK, but I plan to call this a dozen times... and it would still be fine if I were only doing that once, but this database conversion program is very trial-and-error-y, I run, check for problems, change code, repeat.

So, gurus, how do I make this function fast? I figure probably with TID's, but since I sometimes need to leave a final or initial instance, and there could be an arbitrary number of instances in the text I don't need to remove, I didn't instantly see how.

Thanks,

Tom.


Hackintosh built February, 2012 |  Mac OS Sierra
GIGABYTE GA-Z68X-UD3H-B3 | Core i5 2500k | 16 GB DDR3 | GIGABYTE Geforce 1050 TI 4GB
250 GB Samsung 850 EVO | 4 TB RAID
Dell Ultrasharp U3011 | Dell Ultrasharp 2007FPb

Offline

 

#2 2019-07-12 12:47:05 am

Shane Stanley
Member
From:: Australia
Registered: 2002-12-07
Posts: 5763

Re: Slow Text Processing removing leading or trailing text

First, your if construction is problematic -- for a "both" value, the else part is never called.

Here's a regex handler:

Applescript:

use AppleScript version "2.5" -- macOS 10.11 or later
use framework "Foundation"
use scripting additions

on truncate_lines_if(theText, truncateText, position)
   set escapedPattern to (current application's NSRegularExpression's escapedPatternForString:truncateText) as text
   if position is "front" then
       set finalPattern to "(?m)^" & escapedPattern
       set repl to ""
   else if position is "back" then
       set finalPattern to "(?m)" & escapedPattern & "$"
       set repl to ""
   else
       set finalPattern to "(?m)^" & escapedPattern & "(.*)" & escapedPattern & "$"
       set repl to "$1"
   end if
   set theText to current application's NSString's stringWithString:theText
   set theText to theText's stringByReplacingOccurrencesOfString:finalPattern withString:repl options:(current application's NSRegularExpressionSearch) range:{0, theText's |length|()}
   return theText as string
end truncate_lines_if

If you're actually trimming whitespace, this can be done more quickly and simply.


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/
latenightsw.com

Offline

 

#3 2019-07-12 01:03:12 am

KniazidisR
Member
Registered: 2019-03-03
Posts: 234

Re: Slow Text Processing removing leading or trailing text

Why reinvent the wheel? Here I present the ready code from Apple:

Applescript:


on trimText(theText, theCharactersToTrim, theTrimDirection)
   set theTrimLength to length of theCharactersToTrim
   if theTrimDirection is in {"beginning", "both"} then
       repeat while theText begins with theCharactersToTrim
           try
               set theText to characters (theTrimLength + 1) thru -1 of theText as string
           on error
               -- text contains nothing but trim characters
               return ""
           end try
       end repeat
   end if
   if theTrimDirection is in {"end", "both"} then
       repeat while theText ends with theCharactersToTrim
           try
               set theText to characters 1 thru -(theTrimLength + 1) of theText as string
           on error
               -- text contains nothing but trim characters
               return ""
           end try
       end repeat
   end if
   return theText
end trimText


macOS Mojave -- version 10.14.4
Safari -- version 12.1

Offline

 

#4 2019-07-12 01:15:43 am

Shane Stanley
Member
From:: Australia
Registered: 2002-12-07
Posts: 5763

Re: Slow Text Processing removing leading or trailing text

KniazidisR wrote:

Why reinvent the wheel?



Because it's an ancient wheel that wasn't very efficient even in its day, and because it's actually doing something different from the OP's handler anyhow (it's not working on paragraphs, and it's removing repeatedly).


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/
latenightsw.com

Offline

 

#5 2019-07-12 01:22:02 am

KniazidisR
Member
Registered: 2019-03-03
Posts: 234

Re: Slow Text Processing removing leading or trailing text

Well, everything is clear. OK, OK, I will now be more suspicious with Apple lol

Last edited by KniazidisR (2019-07-12 01:23:36 am)


macOS Mojave -- version 10.14.4
Safari -- version 12.1

Offline

 

#6 2019-07-12 03:36:29 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4911

Re: Slow Text Processing removing leading or trailing text

t.spoon wrote:

So, gurus, how do I make this function fast? I figure probably with TID's, but since I sometimes need to leave a final or initial instance, and there could be an arbitrary number of instances in the text I don't need to remove, I didn't instantly see how.


Here's a TIDs version for fun. Like t.spoon's original, it enforces a line ending character in the output that's set in the script. Like Shane's script, it can be adapted to leave final or initial instances as required. It differs from Shane's in its treatment of overlapping instances — say a line has three spaces and two spaces have to be removed from both the beginnings and ends of lines. Shane's leaves the line alone; this deletes only the first two spaces. The required action in this case needs to be defined if the situation's likely to arise:

Applescript:

on truncate_lines_if(theText, truncateText, position)
   set truncateTextLength to (count truncateText)
   set oRet to "
"

   set astid to AppleScript's text item delimiters
   if ((position is "front") or (position is "both")) then
       -- Replace any instance of CRLF, LF, or CR followed by truncateText with the line ending set in oRet.
       set AppleScript's text item delimiters to {return & linefeed & truncateText, linefeed & truncateText, return & truncateText}
       set textItems to theText's text items
       set AppleScript's text item delimiters to oRet
       set theText to textItems as text
       -- Remove truncateText from the beginning of theText if it occurs there.
       if (theText begins with truncateText) then
           if ((count theText) > truncateTextLength) then
               set theText to text (truncateTextLength + 1) thru -1 of theText
           else
               set theText to ""
           end if
       end if
   end if
   if ((position is "back") or (position is "both")) then
       -- Replace any instance of truncateText followed by CRLF, LF, or CR with the line ending set in oRet.
       set AppleScript's text item delimiters to {truncateText & return & linefeed, truncateText & linefeed, truncateText & return}
       set textItems to theText's text items
       set AppleScript's text item delimiters to oRet
       set theText to textItems as text
       -- Remove truncateText from the end of theText if it occurs there.
       if (theText ends with truncateText) then
           if ((count theText) > truncateTextLength) then
               set theText to text 1 thru -(truncateTextLength + 1) of theText
           else
               set theText to ""
           end if
       end if
   end if
   set astid to AppleScript's text item delimiters
   
   return theText
end truncate_lines_if


NG

Offline

 

#7 2019-07-12 06:37:15 am

Shane Stanley
Member
From:: Australia
Registered: 2002-12-07
Posts: 5763

Re: Slow Text Processing removing leading or trailing text

Nigel Garvey wrote:

It differs from Shane's in its treatment of overlapping instances



And its treatment of case, in the (probably unlikely) event that that matters.


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/
latenightsw.com

Offline

 

#8 2019-07-12 08:04:44 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4911

Re: Slow Text Processing removing leading or trailing text

Shane Stanley wrote:

And its treatment of case, in the (probably unlikely) event that that matters.


Ah yes. The TIDs method can be made case sensitive by enclosing the call to the handler in a 'considering case' statement. The regex can be made case insensitive by employing some method to change the "(?m)"s to "(?im)" or "(?mi)".

I've just noticed that your "both" regex only works if there are indeed instances of truncateText at both ends of the line. An alternative would be to use an OR pattern there, but then the "overlap" behaviour would be the same as with the TIDs handler:

Applescript:

use AppleScript version "2.4" -- Mac OS 10.10 or later
use framework "Foundation"
use scripting additions

on truncate_lines_if(theText, truncateText, position)
   -- Set case insensitivity or sensitivity according to the current situation in AS and "^" and "$" to match the beginnings and ends of lines.
   if ("A" = "a") then
       set flagOptions to "(?im)"
   else
       set flagOptions to "(?m)"
   end if
   set escapedPattern to (current application's NSRegularExpression's escapedPatternForString:truncateText) as text
   if position is "front" then
       set finalPattern to flagOptions & "^" & escapedPattern
   else if position is "back" then
       set finalPattern to flagOptions & escapedPattern & "$"
   else
       set finalPattern to flagOptions & "^" & escapedPattern & "|" & escapedPattern & "$" -- OR regex pattern.
   end if
   set theText to current application's NSString's stringWithString:theText
   -- Replace any and every match with "".
   set theText to theText's stringByReplacingOccurrencesOfString:finalPattern withString:"" options:(current application's NSRegularExpressionSearch) range:{0, theText's |length|()}
   return theText as string
end truncate_lines_if


NG

Offline

 

#9 2019-07-12 10:51:01 am

t.spoon
Member
From:: BFE, Massachusetts
Registered: 2013-01-13
Posts: 411

Re: Slow Text Processing removing leading or trailing text

Thanks so much everybody. Thanks for catching my stupid error on the "if," Shane.

For my use, I really only needed to scrape text off the beginning, I'm just in the {good? bad?} habit of trying to make handlers a little more general purpose as I write them.

Nigel, thanks - don't know what was wrong with me that it didn't occur to me to consider the return character combined with what I'm finding as the TID's instead of processing it line-by-line. It seems so obvious as soon as I see it. I really was too tired to be coding at the time. Also, didn't know I could use "linefeed" to get the "other return character."

Anyway, using your handler. Handler run-time went from ≈30 seconds to 0.04 seconds. Can't complain about that.

I'd been meaning to migrate my recipe database to something modern forever. I'm going on vacation tomorrow and didn't even want to take my laptop, then thought "crap, if I don't take my laptop, I won't have my recipes." Of course, I could just take the exported text file of the database on my phone... clunky but functional. But like an idiot, I thought "eh, it won't take me too long to harass this text dump into YAML formatting and import it."

Whenever there's a question on here about text formatting, it seems like even after there's a perfectly good working solution posted, there's a pile-on of additional great solutions. Which makes me think you guys must just be doing this for the fun of it.

So, while I'm sure I can beat this thing into shape on my own after I get back from vacation, just in case anybody else feels like finishing my work for me, I thought I'd post the problem and what I've got. You RegEx people who can do this entire conversion in a single line of endless symbols, please don't laugh at my clunky code.

Aside from posting this here in case anyone else has a similar problem, I thought I'd also send off the final version to the developer just in case he wants to add it to his software to welcome any other Yum! migrants.

So, I'm going from Yum! Recipe Manager to Paprika.

Here's a sample of two recipes worth of what Yum! spits out on export:

RECIPE:  (Not So Bad For You) Chocolate Chip Cookies

SOURCE:  Tom's Brain
CATEGORIES:  Dessert
PREHEAT:  350
MAKES:  12 cookies

INGREDIENTS:

    7/8 cup    whole wheat flour
    1/2 cup    oats
    1/4 cup    wheat germ
    1/2 cup    dried coconut
    1 cup    dark chocolate chips
    4 tbsp    cocoa
    1/3 cup    dark brown sugar
    1/3 cup    honey
    1/2 cup    canola oil
    1/2 teasp    salt
    1/2 teasp    baking soda
    1 lrg    egg
       

METHOD:

Mix & Bake

====================  Yum - http://yum-mac.com/  ====================

RECIPE:  Alevropita

SOURCE:  Saveur (modified slightly)
CATEGORIES:  Appetizer, Bread
PREHEAT:  500
MAKES:  8 -10

INGREDIENTS:

    6 tbsp    extra-virgin olive oil
    2 teasp    vodka
    2    eggs
    1 1/2 cups    flour
    1/4 teasp    salt
    1/4 teasp    baking powder
    8 oz    feta, crumbled
    1 tbsp    butter, small cubes

METHOD:

Notes:  I used double the egg and baking powder compared to the original, which is reflected above.  A half recipe almost fills the toaster oven pan.  The olive oil can be halved if using parchment paper in the pan.

1. Heat oven to 500°. Put an 18" x 13" x 1" rimmed baking sheet into oven for 10 minutes.

2. Meanwhile, whisk together 2 tbsp. oil, vodka, egg, and 1 cup water in a bowl. In a separate bowl, whisk flour, salt, and baking powder. Pour wet mixture over dry mixture and whisk until smooth.

3. Brush remaining oil over bottom of hot pan and add batter, smoothing batter with a rubber spatula to coat the bottom evenly, if necessary. Distribute cheese evenly over batter, and dot with butter. Bake, rotating baking sheet halfway through, until golden brown and crunchy, about 20 minutes. Let cool slightly before slicing and serving.

SERVES 8 – 10


====================  Yum - http://yum-mac.com/  ====================



And here's what Paprika wants as YAML input:

- name: My Tasty Recipe
  servings: 4-6 servings
  source: Food Network
  source_url: http://www.google.com
  prep_time: 10 min
  cook_time: 30 min
  nutritional_info: 500 calories
  on_favorites: yes
  categories: [Dinner, Holiday]
  difficulty: Easy
  rating: 5
  notes: |
    This is delicious!!!
  ingredients: |
    1/2 lb meat
    1/2 lb vegetables
    salt
    pepper
    2 tbsp olive oil
    4 cups flour
  directions: |
    Mix things together.
    Eat.
    Tasty.
    Yum yum yum.
   
- name: My Tasty Recipe 2
  servings: 4-6 servings
  source: Food Network
  source_url: http://www.google.com
  prep_time: 10 min
  cook_time: 30 min
  nutritional_info: 500 calories
  difficulty: Easy
  rating: 5
  photo: | (base-64 encoded image)
  notes: |
    This is delicious!!!
  ingredients: |
    1/2 lb meat
    1/2 lb vegetables
    salt
    pepper
    2 tbsp olive oil
    4 cups flour
  directions: |
    Mix things together.
    Eat.
    Tasty.
    Yum yum yum.



Some notes on what I've got done and what needs to be done:
Done:
Replaced Key Values
Added a pipe for multi-line values
Moved Information for keys that don't exist in new format to "notes" section
Converted tabs to spaces

To Do:
fix leading whitespace on each line
remove all colons that don't follow keys
remove lines that only contain blank characters
?

There's a handy YAML online parser to validate the format here:
http://yaml-online-parser.appspot.com

Anyway, if anyone feels like taking a crack at it for fun, please do. Otherwise I'll finish it when I get back.

Thanks,

t.spoon

... well, that's a first. Macscripter wouldn't let me post my script because "it's too many bytes."

Because it includes the recipe database... still, it's under a megabyte.

Here it is:

https://www.dropbox.com/s/lx6j03bftrnho … .scpt?dl=0


Hackintosh built February, 2012 |  Mac OS Sierra
GIGABYTE GA-Z68X-UD3H-B3 | Core i5 2500k | 16 GB DDR3 | GIGABYTE Geforce 1050 TI 4GB
250 GB Samsung 850 EVO | 4 TB RAID
Dell Ultrasharp U3011 | Dell Ultrasharp 2007FPb

Offline

 

#10 2019-07-12 06:17:33 pm

Shane Stanley
Member
From:: Australia
Registered: 2002-12-07
Posts: 5763

Re: Slow Text Processing removing leading or trailing text

Nigel Garvey wrote:

I've just noticed that your "both" regex only works if there are indeed instances of truncateText at both ends of the line.



Yes, that's quite a hole sad


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/
latenightsw.com

Offline

 

#11 2019-07-13 01:48:37 am

CK
Member
From:: UK
Registered: 2018-11-04
Posts: 92

Re: Slow Text Processing removing leading or trailing text

Just offering what is actually a refactoring of the OP's original handler aiming to reduce execution time with some—by now—fairly well known AppleScript techniques plus a reduction in redundant/repeat operations:

Applescript:

to trimlines by phrase as text out of input as text from place : left
       local input, phrase, place
       
       if the phrase is not in the input then return the input
       set N to 1 + the (phrase's length)
       
       set a to (place contains left) as integer
       set b to ((place contains right) as integer) * 2
       set i to a + b + 1
       
       script sentences
               property list : the input's paragraphs
       end script
       
       script trim
               on _L(str)
                       if str starts with the phrase then ¬
                               return str's text N thru -1
                       str
               end _L
               
               on _R(str)
                       if str ends with the phrase then ¬
                               return str's text 1 thru -N
                       str
               end _R
               
               on _LR(str)
                       tell trim to _L(_R(str))
               end _LR
       end script
       
       script fn
               property func : item i of trim's {_L, _L, _R, _LR}
       end script
       
       repeat with sentence in (a reference to the list of sentences)
               set {untrimmed, trimmed} to {the sentence's contents, ""}
               if phrase ≠ untrimmed then set trimmed to fn's func(untrimmed)
               set the sentence's contents to trimmed
       end repeat
       
       set text item delimiters to linefeed
       return the list of sentences as text
end trimlines

An example call to this handler might be:

Applescript:

trimlines by "some text" out of "some text here.
then some text there.
and also some text.
with some text
but not until
some text is there."
from the left & right

which hopefully lets one deduce the nature of each parameter quite easily, but for clarity: the from parameter could also be just one of those AppleScript constants, i.e. from the left or from the right, and will default to left if the parameter is omitted or invalid.  The other two parameters are, of course, mandatory.  The above call outputs:

" here.
then some text there.
and also some text.
with 
but not until
 is there."

NB. A call to this handler as it stands will result in text item delimiters being set to linefeed.  Some people prefer resetting as they go: I always set them immediately prior to any list → text coercion; adapt as your prefer.

I haven't benchmarked its performance nor that of the original.

AppleScript: 2.7
Operating System: macOS 10.13

Last edited by CK (2019-07-15 09:12:48 am)

Offline

 

#12 2019-07-13 04:53:20 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4911

Re: Slow Text Processing removing leading or trailing text

CK wrote:

Applescript:

to trimlines by phrase as text out of input as text from place : left
   local input, phrase, place
   
   set a to (place contains left) as integer
   set b to ((place contains right) as integer) * 2
   set i to a + b + 1
   
   set N to 1 + the (phrase's length)
   if N = 0 then return the input
   
   script sentences
       property list : the input's paragraphs
   end script


Hmmm. An interesting abuse of the reserved terms left, right, and listroll  And I've not seen that method of implementing optional parameters before. Apparently it was introduced in OS X Yosemite (10.10) and only works for handlers with labelled parameters. I've learned something new today!  smile  Fooling around with it, it seems that while any or all of the parameters may be optional, at least one labelled value must be passed in the call to the handler or the handler will simply be returned rather than executed. If all the parameters are optional, the one or more labels used in the call don't have to match any of those in the handler definition! (Not in Mojave, anyway.)

Applescript:

on fred from c : "Hello" to d : 5
   return c
end fred

fred by missing value under "aardvark" --> "Hello"

Last edited by Nigel Garvey (2019-07-13 05:03:51 am)


NG

Offline

 

#13 2019-07-13 05:57:32 am

Shane Stanley
Member
From:: Australia
Registered: 2002-12-07
Posts: 5763

Re: Slow Text Processing removing leading or trailing text

Nigel Garvey wrote:

only works for handlers with labelled parameters



And that includes parameters defined in a scripting dictionary (which is why it was introduced).


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/
latenightsw.com

Offline

 

#14 2019-07-14 02:14:54 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 4911

Re: Slow Text Processing removing leading or trailing text

Shane Stanley wrote:
Nigel Garvey wrote:

only works for handlers with labelled parameters



And that includes parameters defined in a scripting dictionary (which is why it was introduced).


Hi Shane. Thanks for the background.

CK wrote:

I haven't benchmarked its performance nor that of the original.


I've just been comparing the speeds of the three handlers above which return the same results: ie. the TIDs method (post #6), the revised ASObjC (post #8), and CK's optimised repeat (post #11).

On my iMac, they're all practically the same speed with up to sixteen lines or so. With longer texts, they're still pretty much the same for user experience, but the ASObjC handler consistently times as the fastest of the three (by a few milliseconds) and the optimised repeat as the faster of the two vanilla methods (ditto, by how much depending on how much needs to be done). But there are a couple of exceptions:
• If no trimming actually occurs, the TIDs handler is suddenly faster than the optimised repeat.
• If a line in the text exactly equals the "phrase" to be cut, the optimised repeat errors.


NG

Offline

 

#15 2019-07-14 09:37:33 am

Marc Anthony
Member
From:: Dallas, TX
Registered: 2006-04-27
Posts: 885

Re: Slow Text Processing removing leading or trailing text

Hi. I’m not sure that I see the initially posted handler’s—or its descendants’—utility for your situation. Looking at your dropbox file, there are swaths of text with inconsistencies. I would move the project into a grep-aware text editor, such as TextWrangler; this provides an opportunity to learn regular expressions, easily preview changes, and correct spelling and other composition issues. This should get you started:

Applescript:

tell application "TextWrangler"
   tell document 1
       repeat with this in {{"SOURCE:", "source:"}, {"CATEGORIES:", "categories:"}, {"PREHEAT:", "Preheat to"}, {"METHOD:", "directions: |"}, {"INGREDIENTS:", "ingredients: |"}, {"MAKES:", "Makes"}, {"RECIPE:", "- name:"}, {"¼", "1/4"}, {"½", "1/2"}, {"¾", "3/4"}}
           replace (this's item 1) using (this's item 2) searching in it options {starting at top:1, returning results:0} saving no
       end repeat
       
       #GREP conversions
       --potential whitespace followed by returns to return
       --tabs to space
       --lines beginning with equals with potential text to return
       --word spaces to space
       repeat with this in {{"[[:space:]]*\\r+", "\\r"}, {"\\t+", space}, {"^={2,}.*", "\\r"}, {"[ ]{2,}", space}}
           replace (this's item 1) using (this's item 2) searching in it options {search mode:grep, starting at top:1, returning results:0} saving no
       end repeat
       
       activate
   end tell
end tell

--edited for typos

Last edited by Marc Anthony (2019-07-15 04:41:02 pm)

Offline

 

#16 2019-07-15 09:26:05 am

CK
Member
From:: UK
Registered: 2018-11-04
Posts: 92

Re: Slow Text Processing removing leading or trailing text

Nigel Garvey wrote:

I've just been comparing the speeds of the three handlers above

Thanks for doing this, Nigel.

Nigel Garvey wrote:

• If a line in the text exactly equals the "phrase" to be cut, the optimised repeat errors.

✅ Fixed.  Thank you for spotting this.

Nigel Garvey wrote:

• If no trimming actually occurs, the TIDs handler is suddenly faster than the optimised repeat.

✅ Added initial check for this special case.

The above two amendments have been applied to the "optimised repeat" handler (trimlines) and the original post (post #11) has been edited to reflect these changes.

Offline

 

Board footer

Powered by FluxBB

RSS (new topics) RSS (active topics)