Slow Text Processing removing leading or trailing text

t.spoon · July 12, 2019, 3:16am

I stayed stuck in an old discontinued Recipe database for too long (Yum!)… I can’t find anything that imports it’s files… and it doesn’t seem to export as anything that other recipe managers I’ve looked at want to import.

So… lots of text processing. You people here who are aces with RegEx probably could have knocked out this project in 10 minutes… heck, I’m starting to think I could have done it faster if I’d just learned RegEx. Anyway, if all you have is a hammer…

So, one of my functions is this:

-- "position" should be "front," "back," or "both" to specify whether trimming the fronts or backs of each line
on truncate_lines_if(theText, truncateText, position)
	set oRet to "
"
	set theOutput to ""
	set charCount to count of characters of truncateText
	if position is "front" or position is "both" then
		repeat with aLine in paragraphs of theText
			if aLine starts with truncateText then
				set newLine to text (charCount + 1) through end of aLine
			else
				set newLine to aLine
			end if
			set theOutput to theOutput & newLine & oRet
		end repeat
		return theOutput
	else if position is "back" or position is "both" then
		repeat with aLine in paragraphs of theText
			if aLine ends with truncateText then
				set lineCount to (count of characters of contents of aLine)
				set newLine to text 1 through (lineCount - charCount) of aLine
			else
				set newLine to aLine
			end if
			set theOutput to theOutput & newLine & oRet
		end repeat
	end if
	return theOutput
end truncate_lines_if

I was already doing about a hundred manipulations on the whole database and my script run time was 0.18 seconds… adding a single call to this function took the run time up to half a minute. Which would be OK, but I plan to call this a dozen times… and it would still be fine if I were only doing that once, but this database conversion program is very trial-and-error-y, I run, check for problems, change code, repeat.

So, gurus, how do I make this function fast? I figure probably with TID’s, but since I sometimes need to leave a final or initial instance, and there could be an arbitrary number of instances in the text I don’t need to remove, I didn’t instantly see how.

Thanks,

Tom.

Shane_Stanley · July 12, 2019, 6:47am

First, your if construction is problematic – for a “both” value, the else part is never called.

Here’s a regex handler:

use AppleScript version "2.5" -- macOS 10.11 or later
use framework "Foundation"
use scripting additions

on truncate_lines_if(theText, truncateText, position)
	set escapedPattern to (current application's NSRegularExpression's escapedPatternForString:truncateText) as text
	if position is "front" then
		set finalPattern to "(?m)^" & escapedPattern
		set repl to ""
	else if position is "back" then
		set finalPattern to "(?m)" & escapedPattern & "$"
		set repl to ""
	else
		set finalPattern to "(?m)^" & escapedPattern & "(.*)" & escapedPattern & "$"
		set repl to "$1"
	end if
	set theText to current application's NSString's stringWithString:theText
	set theText to theText's stringByReplacingOccurrencesOfString:finalPattern withString:repl options:(current application's NSRegularExpressionSearch) range:{0, theText's |length|()}
	return theText as string
end truncate_lines_if

If you’re actually trimming whitespace, this can be done more quickly and simply.

KniazidisR · July 12, 2019, 7:03am

Why reinvent the wheel? Here I present the ready code from Apple:


on trimText(theText, theCharactersToTrim, theTrimDirection)
	set theTrimLength to length of theCharactersToTrim
	if theTrimDirection is in {"beginning", "both"} then
		repeat while theText begins with theCharactersToTrim
			try
				set theText to characters (theTrimLength + 1) thru -1 of theText as string
			on error
				-- text contains nothing but trim characters
				return ""
			end try
		end repeat
	end if
	if theTrimDirection is in {"end", "both"} then
		repeat while theText ends with theCharactersToTrim
			try
				set theText to characters 1 thru -(theTrimLength + 1) of theText as string
			on error
				-- text contains nothing but trim characters
				return ""
			end try
		end repeat
	end if
	return theText
end trimText

Shane_Stanley · July 12, 2019, 7:15am

Because it’s an ancient wheel that wasn’t very efficient even in its day, and because it’s actually doing something different from the OP’s handler anyhow (it’s not working on paragraphs, and it’s removing repeatedly).

KniazidisR · July 12, 2019, 7:22am

Well, everything is clear. OK, OK, I will now be more suspicious with Apple :lol:

Nigel_Garvey · July 12, 2019, 9:36am

Here’s a TIDs version for fun. Like t.spoon’s original, it enforces a line ending character in the output that’s set in the script. Like Shane’s script, it can be adapted to leave final or initial instances as required. It differs from Shane’s in its treatment of overlapping instances — say a line has three spaces and two spaces have to be removed from both the beginnings and ends of lines. Shane’s leaves the line alone; this deletes only the first two spaces. The required action in this case needs to be defined if the situation’s likely to arise:

on truncate_lines_if(theText, truncateText, position)
	set truncateTextLength to (count truncateText)
	set oRet to "
"
	set astid to AppleScript's text item delimiters
	if ((position is "front") or (position is "both")) then
		-- Replace any instance of CRLF, LF, or CR followed by truncateText with the line ending set in oRet.
		set AppleScript's text item delimiters to {return & linefeed & truncateText, linefeed & truncateText, return & truncateText}
		set textItems to theText's text items
		set AppleScript's text item delimiters to oRet
		set theText to textItems as text
		-- Remove truncateText from the beginning of theText if it occurs there.
		if (theText begins with truncateText) then
			if ((count theText) > truncateTextLength) then
				set theText to text (truncateTextLength + 1) thru -1 of theText
			else
				set theText to ""
			end if
		end if
	end if
	if ((position is "back") or (position is "both")) then
		-- Replace any instance of truncateText followed by CRLF, LF, or CR with the line ending set in oRet.
		set AppleScript's text item delimiters to {truncateText & return & linefeed, truncateText & linefeed, truncateText & return}
		set textItems to theText's text items
		set AppleScript's text item delimiters to oRet
		set theText to textItems as text
		-- Remove truncateText from the end of theText if it occurs there.
		if (theText ends with truncateText) then
			if ((count theText) > truncateTextLength) then
				set theText to text 1 thru -(truncateTextLength + 1) of theText
			else
				set theText to ""
			end if
		end if
	end if
	set astid to AppleScript's text item delimiters
	
	return theText
end truncate_lines_if

Shane_Stanley · July 12, 2019, 12:37pm

And its treatment of case, in the (probably unlikely) event that that matters.

Nigel_Garvey · July 12, 2019, 2:04pm

Ah yes. The TIDs method can be made case sensitive by enclosing the call to the handler in a ‘considering case’ statement. The regex can be made case insensitive by employing some method to change the "(?m)"s to “(?im)” or “(?mi)”.

I’ve just noticed that your “both” regex only works if there are indeed instances of truncateText at both ends of the line. An alternative would be to use an OR pattern there, but then the “overlap” behaviour would be the same as with the TIDs handler:

use AppleScript version "2.4" -- Mac OS 10.10 or later
use framework "Foundation"
use scripting additions

on truncate_lines_if(theText, truncateText, position)
	-- Set case insensitivity or sensitivity according to the current situation in AS and "^" and "$" to match the beginnings and ends of lines.
	if ("A" = "a") then
		set flagOptions to "(?im)"
	else
		set flagOptions to "(?m)"
	end if
	set escapedPattern to (current application's NSRegularExpression's escapedPatternForString:truncateText) as text
	if position is "front" then
		set finalPattern to flagOptions & "^" & escapedPattern
	else if position is "back" then
		set finalPattern to flagOptions & escapedPattern & "$"
	else
		set finalPattern to flagOptions & "^" & escapedPattern & "|" & escapedPattern & "$" -- OR regex pattern.
	end if
	set theText to current application's NSString's stringWithString:theText
	-- Replace any and every match with "".
	set theText to theText's stringByReplacingOccurrencesOfString:finalPattern withString:"" options:(current application's NSRegularExpressionSearch) range:{0, theText's |length|()}
	return theText as string
end truncate_lines_if

t.spoon · July 12, 2019, 4:51pm

Thanks so much everybody. Thanks for catching my stupid error on the “if,” Shane.

For my use, I really only needed to scrape text off the beginning, I’m just in the {good? bad?} habit of trying to make handlers a little more general purpose as I write them.

Nigel, thanks - don’t know what was wrong with me that it didn’t occur to me to consider the return character combined with what I’m finding as the TID’s instead of processing it line-by-line. It seems so obvious as soon as I see it. I really was too tired to be coding at the time. Also, didn’t know I could use “linefeed” to get the “other return character.”

Anyway, using your handler. Handler run-time went from ≈30 seconds to 0.04 seconds. Can’t complain about that.

I’d been meaning to migrate my recipe database to something modern forever. I’m going on vacation tomorrow and didn’t even want to take my laptop, then thought “crap, if I don’t take my laptop, I won’t have my recipes.” Of course, I could just take the exported text file of the database on my phone… clunky but functional. But like an idiot, I thought “eh, it won’t take me too long to harass this text dump into YAML formatting and import it.”

Whenever there’s a question on here about text formatting, it seems like even after there’s a perfectly good working solution posted, there’s a pile-on of additional great solutions. Which makes me think you guys must just be doing this for the fun of it.

So, while I’m sure I can beat this thing into shape on my own after I get back from vacation, just in case anybody else feels like finishing my work for me, I thought I’d post the problem and what I’ve got. You RegEx people who can do this entire conversion in a single line of endless symbols, please don’t laugh at my clunky code.

Aside from posting this here in case anyone else has a similar problem, I thought I’d also send off the final version to the developer just in case he wants to add it to his software to welcome any other Yum! migrants.

So, I’m going from Yum! Recipe Manager to Paprika.

Here’s a sample of two recipes worth of what Yum! spits out on export:

RECIPE: (Not So Bad For You) Chocolate Chip Cookies

SOURCE: Tom’s Brain
CATEGORIES: Dessert
PREHEAT: 350
MAKES: 12 cookies

INGREDIENTS:
7/8 cup    whole wheat flour
1/2 cup    oats
1/4 cup    wheat germ
1/2 cup    dried coconut
1 cup    dark chocolate chips
4 tbsp    cocoa
1/3 cup    dark brown sugar
1/3 cup    honey
1/2 cup    canola oil
1/2 teasp    salt
1/2 teasp    baking soda
1 lrg    egg
METHOD:

Mix & Bake

==================== Yum - http://yum-mac.com/ ====================

RECIPE: Alevropita

SOURCE: Saveur (modified slightly)
CATEGORIES: Appetizer, Bread
PREHEAT: 500
MAKES: 8 -10

INGREDIENTS:
6 tbsp    extra-virgin olive oil
2 teasp    vodka
2    eggs
1 1/2 cups    flour
1/4 teasp    salt
1/4 teasp    baking powder
8 oz    feta, crumbled
1 tbsp    butter, small cubes
METHOD:

Notes: I used double the egg and baking powder compared to the original, which is reflected above. A half recipe almost fills the toaster oven pan. The olive oil can be halved if using parchment paper in the pan.

Heat oven to 500°. Put an 18" x 13" x 1" rimmed baking sheet into oven for 10 minutes.

Meanwhile, whisk together 2 tbsp. oil, vodka, egg, and 1 cup water in a bowl. In a separate bowl, whisk flour, salt, and baking powder. Pour wet mixture over dry mixture and whisk until smooth.

Brush remaining oil over bottom of hot pan and add batter, smoothing batter with a rubber spatula to coat the bottom evenly, if necessary. Distribute cheese evenly over batter, and dot with butter. Bake, rotating baking sheet halfway through, until golden brown and crunchy, about 20 minutes. Let cool slightly before slicing and serving.

SERVES 8 – 10

==================== Yum - http://yum-mac.com/ ====================

And here’s what Paprika wants as YAML input:

name: My Tasty Recipe
servings: 4-6 servings
source: Food Network
source_url: http://www.google.com
prep_time: 10 min
cook_time: 30 min
nutritional_info: 500 calories
on_favorites: yes
categories: [Dinner, Holiday]
difficulty: Easy
rating: 5
notes: |
This is delicious!!!
ingredients: |
1/2 lb meat
1/2 lb vegetables
salt
pepper
2 tbsp olive oil
4 cups flour
directions: |
Mix things together.
Eat.
Tasty.
Yum yum yum.

name: My Tasty Recipe 2
servings: 4-6 servings
source: Food Network
source_url: http://www.google.com
prep_time: 10 min
cook_time: 30 min
nutritional_info: 500 calories
difficulty: Easy
rating: 5
photo: | (base-64 encoded image)
notes: |
This is delicious!!!
ingredients: |
1/2 lb meat
1/2 lb vegetables
salt
pepper
2 tbsp olive oil
4 cups flour
directions: |
Mix things together.
Eat.
Tasty.
Yum yum yum.

Some notes on what I’ve got done and what needs to be done:
Done:
Replaced Key Values
Added a pipe for multi-line values
Moved Information for keys that don’t exist in new format to “notes” section
Converted tabs to spaces

To Do:
fix leading whitespace on each line
remove all colons that don’t follow keys
remove lines that only contain blank characters
?

There’s a handy YAML online parser to validate the format here:
http://yaml-online-parser.appspot.com

Anyway, if anyone feels like taking a crack at it for fun, please do. Otherwise I’ll finish it when I get back.

Thanks,

t.spoon

… well, that’s a first. Macscripter wouldn’t let me post my script because “it’s too many bytes.”

Because it includes the recipe database… still, it’s under a megabyte.

Here it is:

https://www.dropbox.com/s/lx6j03bftrnhoc2/Recipe%20Converter%20for%20Sharing.scpt?dl=0

Shane_Stanley · July 13, 2019, 12:17am

Yes, that’s quite a hole

CK11 · July 13, 2019, 7:48am

Just offering what is actually a refactoring of the OP’s original handler aiming to reduce execution time with some—by now—fairly well known AppleScript techniques plus a reduction in redundant/repeat operations:

to trimlines by phrase as text out of input as text from place : left
		local input, phrase, place
		
		if the phrase is not in the input then return the input
		set N to 1 + the (phrase's length)
		
		set a to (place contains left) as integer
		set b to ((place contains right) as integer) * 2
		set i to a + b + 1
		
		script sentences
				property list : the input's paragraphs
		end script
		
		script trim
				on _L(str)
						if str starts with the phrase then ¬
								return str's text N thru -1
						str
				end _L
				
				on _R(str)
						if str ends with the phrase then ¬
								return str's text 1 thru -N
						str
				end _R
				
				on _LR(str)
						tell trim to _L(_R(str))
				end _LR
		end script
		
		script fn
				property func : item i of trim's {_L, _L, _R, _LR}
		end script
		
		repeat with sentence in (a reference to the list of sentences)
				set {untrimmed, trimmed} to {the sentence's contents, ""}
				if phrase ≠ untrimmed then set trimmed to fn's func(untrimmed)
				set the sentence's contents to trimmed
		end repeat
		
		set text item delimiters to linefeed
		return the list of sentences as text
end trimlines

An example call to this handler might be:

trimlines by "some text" out of "some text here.
then some text there.
and also some text.
with some text
but not until
some text is there." from the left & right

which hopefully lets one deduce the nature of each parameter quite easily, but for clarity: the from parameter could also be just one of those AppleScript constants, i.e. from the left or from the right, and will default to left if the parameter is omitted or invalid. The other two parameters are, of course, mandatory. The above call outputs:[format]" here.
then some text there.
and also some text.
with
but not until
is there."[/format]

NB. A call to this handler as it stands will result in text item delimiters being set to linefeed. Some people prefer resetting as they go: I always set them immediately prior to any list → text coercion; adapt as your prefer.

I haven’t benchmarked its performance nor that of the original.

AppleScript: 2.7
Operating System: macOS 10.13

Nigel_Garvey · July 13, 2019, 10:53am

CK:

to trimlines by phrase as text out of input as text from place : left
	local input, phrase, place
	
	set a to (place contains left) as integer
	set b to ((place contains right) as integer) * 2
	set i to a + b + 1
	
	set N to 1 + the (phrase's length)
	if N = 0 then return the input
	
	script sentences
		property list : the input's paragraphs
	end script

Hmmm. An interesting abuse of the reserved terms left, right, and list. :rolleyes: And I’ve not seen that method of implementing optional parameters before. Apparently it was introduced in OS X Yosemite (10.10) and only works for handlers with labelled parameters. I’ve learned something new today! Fooling around with it, it seems that while any or all of the parameters may be optional, at least one labelled value must be passed in the call to the handler or the handler will simply be returned rather than executed. If all the parameters are optional, the one or more labels used in the call don’t have to match any of those in the handler definition! (Not in Mojave, anyway.)

on fred from c : "Hello" to d : 5
	return c
end fred

fred by missing value under "aardvark" --> "Hello"

Shane_Stanley · July 13, 2019, 11:57am

And that includes parameters defined in a scripting dictionary (which is why it was introduced).

Nigel_Garvey · July 14, 2019, 8:14am

Hi Shane. Thanks for the background.

I’ve just been comparing the speeds of the three handlers above which return the same results: ie. the TIDs method (post #6), the revised ASObjC (post #8), and CK’s optimised repeat (post #11).

On my iMac, they’re all practically the same speed with up to sixteen lines or so. With longer texts, they’re still pretty much the same for user experience, but the ASObjC handler consistently times as the fastest of the three (by a few milliseconds) and the optimised repeat as the faster of the two vanilla methods (ditto, by how much depending on how much needs to be done). But there are a couple of exceptions:
• If no trimming actually occurs, the TIDs handler is suddenly faster than the optimised repeat.
• If a line in the text exactly equals the “phrase” to be cut, the optimised repeat errors.

Marc_Anthony · July 14, 2019, 3:37pm

Hi. I’m not sure that I see the initially posted handler’s—or its descendants’—utility for your situation. Looking at your dropbox file, there are swaths of text with inconsistencies. I would move the project into a grep-aware text editor, such as TextWrangler; this provides an opportunity to learn regular expressions, easily preview changes, and correct spelling and other composition issues. This should get you started:

tell application "TextWrangler"
	tell document 1
		repeat with this in {{"SOURCE:", "source:"}, {"CATEGORIES:", "categories:"}, {"PREHEAT:", "Preheat to"}, {"METHOD:", "directions: |"}, {"INGREDIENTS:", "ingredients: |"}, {"MAKES:", "Makes"}, {"RECIPE:", "- name:"}, {"¼", "1/4"}, {"½", "1/2"}, {"¾", "3/4"}}
			replace (this's item 1) using (this's item 2) searching in it options {starting at top:1, returning results:0} saving no
		end repeat
		
		#GREP conversions
		--potential whitespace followed by returns to return
		--tabs to space
		--lines beginning with equals with potential text to return
		--word spaces to space
		repeat with this in {{"[[:space:]]*\\r+", "\\r"}, {"\\t+", space}, {"^={2,}.*", "\\r"}, {"[ ]{2,}", space}}
			replace (this's item 1) using (this's item 2) searching in it options {search mode:grep, starting at top:1, returning results:0} saving no
		end repeat
		
		activate
	end tell
end tell

–edited for typos

CK11 · July 15, 2019, 3:26pm

Thanks for doing this, Nigel.

Fixed. Thank you for spotting this.

Added initial check for this special case.

The above two amendments have been applied to the “optimised repeat” handler (trimlines) and the original post (post #11) has been edited to reflect these changes.