I stayed stuck in an old discontinued Recipe database for too long (Yum!)… I can’t find anything that imports it’s files… and it doesn’t seem to export as anything that other recipe managers I’ve looked at want to import.
So… lots of text processing. You people here who are aces with RegEx probably could have knocked out this project in 10 minutes… heck, I’m starting to think I could have done it faster if I’d just learned RegEx. Anyway, if all you have is a hammer…
So, one of my functions is this:
-- "position" should be "front," "back," or "both" to specify whether trimming the fronts or backs of each line
on truncate_lines_if(theText, truncateText, position)
set oRet to "
"
set theOutput to ""
set charCount to count of characters of truncateText
if position is "front" or position is "both" then
repeat with aLine in paragraphs of theText
if aLine starts with truncateText then
set newLine to text (charCount + 1) through end of aLine
else
set newLine to aLine
end if
set theOutput to theOutput & newLine & oRet
end repeat
return theOutput
else if position is "back" or position is "both" then
repeat with aLine in paragraphs of theText
if aLine ends with truncateText then
set lineCount to (count of characters of contents of aLine)
set newLine to text 1 through (lineCount - charCount) of aLine
else
set newLine to aLine
end if
set theOutput to theOutput & newLine & oRet
end repeat
end if
return theOutput
end truncate_lines_if
I was already doing about a hundred manipulations on the whole database and my script run time was 0.18 seconds… adding a single call to this function took the run time up to half a minute. Which would be OK, but I plan to call this a dozen times… and it would still be fine if I were only doing that once, but this database conversion program is very trial-and-error-y, I run, check for problems, change code, repeat.
So, gurus, how do I make this function fast? I figure probably with TID’s, but since I sometimes need to leave a final or initial instance, and there could be an arbitrary number of instances in the text I don’t need to remove, I didn’t instantly see how.
First, your if construction is problematic – for a “both” value, the else part is never called.
Here’s a regex handler:
use AppleScript version "2.5" -- macOS 10.11 or later
use framework "Foundation"
use scripting additions
on truncate_lines_if(theText, truncateText, position)
set escapedPattern to (current application's NSRegularExpression's escapedPatternForString:truncateText) as text
if position is "front" then
set finalPattern to "(?m)^" & escapedPattern
set repl to ""
else if position is "back" then
set finalPattern to "(?m)" & escapedPattern & "$"
set repl to ""
else
set finalPattern to "(?m)^" & escapedPattern & "(.*)" & escapedPattern & "$"
set repl to "$1"
end if
set theText to current application's NSString's stringWithString:theText
set theText to theText's stringByReplacingOccurrencesOfString:finalPattern withString:repl options:(current application's NSRegularExpressionSearch) range:{0, theText's |length|()}
return theText as string
end truncate_lines_if
If you’re actually trimming whitespace, this can be done more quickly and simply.
Why reinvent the wheel? Here I present the ready code from Apple:
on trimText(theText, theCharactersToTrim, theTrimDirection)
set theTrimLength to length of theCharactersToTrim
if theTrimDirection is in {"beginning", "both"} then
repeat while theText begins with theCharactersToTrim
try
set theText to characters (theTrimLength + 1) thru -1 of theText as string
on error
-- text contains nothing but trim characters
return ""
end try
end repeat
end if
if theTrimDirection is in {"end", "both"} then
repeat while theText ends with theCharactersToTrim
try
set theText to characters 1 thru -(theTrimLength + 1) of theText as string
on error
-- text contains nothing but trim characters
return ""
end try
end repeat
end if
return theText
end trimText
Because it’s an ancient wheel that wasn’t very efficient even in its day, and because it’s actually doing something different from the OP’s handler anyhow (it’s not working on paragraphs, and it’s removing repeatedly).
Here’s a TIDs version for fun. Like t.spoon’s original, it enforces a line ending character in the output that’s set in the script. Like Shane’s script, it can be adapted to leave final or initial instances as required. It differs from Shane’s in its treatment of overlapping instances — say a line has three spaces and two spaces have to be removed from both the beginnings and ends of lines. Shane’s leaves the line alone; this deletes only the first two spaces. The required action in this case needs to be defined if the situation’s likely to arise:
on truncate_lines_if(theText, truncateText, position)
set truncateTextLength to (count truncateText)
set oRet to "
"
set astid to AppleScript's text item delimiters
if ((position is "front") or (position is "both")) then
-- Replace any instance of CRLF, LF, or CR followed by truncateText with the line ending set in oRet.
set AppleScript's text item delimiters to {return & linefeed & truncateText, linefeed & truncateText, return & truncateText}
set textItems to theText's text items
set AppleScript's text item delimiters to oRet
set theText to textItems as text
-- Remove truncateText from the beginning of theText if it occurs there.
if (theText begins with truncateText) then
if ((count theText) > truncateTextLength) then
set theText to text (truncateTextLength + 1) thru -1 of theText
else
set theText to ""
end if
end if
end if
if ((position is "back") or (position is "both")) then
-- Replace any instance of truncateText followed by CRLF, LF, or CR with the line ending set in oRet.
set AppleScript's text item delimiters to {truncateText & return & linefeed, truncateText & linefeed, truncateText & return}
set textItems to theText's text items
set AppleScript's text item delimiters to oRet
set theText to textItems as text
-- Remove truncateText from the end of theText if it occurs there.
if (theText ends with truncateText) then
if ((count theText) > truncateTextLength) then
set theText to text 1 thru -(truncateTextLength + 1) of theText
else
set theText to ""
end if
end if
end if
set astid to AppleScript's text item delimiters
return theText
end truncate_lines_if
Ah yes. The TIDs method can be made case sensitive by enclosing the call to the handler in a ‘considering case’ statement. The regex can be made case insensitive by employing some method to change the "(?m)"s to “(?im)” or “(?mi)”.
I’ve just noticed that your “both” regex only works if there are indeed instances of truncateText at both ends of the line. An alternative would be to use an OR pattern there, but then the “overlap” behaviour would be the same as with the TIDs handler:
use AppleScript version "2.4" -- Mac OS 10.10 or later
use framework "Foundation"
use scripting additions
on truncate_lines_if(theText, truncateText, position)
-- Set case insensitivity or sensitivity according to the current situation in AS and "^" and "$" to match the beginnings and ends of lines.
if ("A" = "a") then
set flagOptions to "(?im)"
else
set flagOptions to "(?m)"
end if
set escapedPattern to (current application's NSRegularExpression's escapedPatternForString:truncateText) as text
if position is "front" then
set finalPattern to flagOptions & "^" & escapedPattern
else if position is "back" then
set finalPattern to flagOptions & escapedPattern & "$"
else
set finalPattern to flagOptions & "^" & escapedPattern & "|" & escapedPattern & "$" -- OR regex pattern.
end if
set theText to current application's NSString's stringWithString:theText
-- Replace any and every match with "".
set theText to theText's stringByReplacingOccurrencesOfString:finalPattern withString:"" options:(current application's NSRegularExpressionSearch) range:{0, theText's |length|()}
return theText as string
end truncate_lines_if
Thanks so much everybody. Thanks for catching my stupid error on the “if,” Shane.
For my use, I really only needed to scrape text off the beginning, I’m just in the {good? bad?} habit of trying to make handlers a little more general purpose as I write them.
Nigel, thanks - don’t know what was wrong with me that it didn’t occur to me to consider the return character combined with what I’m finding as the TID’s instead of processing it line-by-line. It seems so obvious as soon as I see it. I really was too tired to be coding at the time. Also, didn’t know I could use “linefeed” to get the “other return character.”
Anyway, using your handler. Handler run-time went from ≈30 seconds to 0.04 seconds. Can’t complain about that.
I’d been meaning to migrate my recipe database to something modern forever. I’m going on vacation tomorrow and didn’t even want to take my laptop, then thought “crap, if I don’t take my laptop, I won’t have my recipes.” Of course, I could just take the exported text file of the database on my phone… clunky but functional. But like an idiot, I thought “eh, it won’t take me too long to harass this text dump into YAML formatting and import it.”
Whenever there’s a question on here about text formatting, it seems like even after there’s a perfectly good working solution posted, there’s a pile-on of additional great solutions. Which makes me think you guys must just be doing this for the fun of it.
So, while I’m sure I can beat this thing into shape on my own after I get back from vacation, just in case anybody else feels like finishing my work for me, I thought I’d post the problem and what I’ve got. You RegEx people who can do this entire conversion in a single line of endless symbols, please don’t laugh at my clunky code.
Aside from posting this here in case anyone else has a similar problem, I thought I’d also send off the final version to the developer just in case he wants to add it to his software to welcome any other Yum! migrants.
So, I’m going from Yum! Recipe Manager to Paprika.
Here’s a sample of two recipes worth of what Yum! spits out on export:
And here’s what Paprika wants as YAML input:
Some notes on what I’ve got done and what needs to be done:
Done:
Replaced Key Values
Added a pipe for multi-line values
Moved Information for keys that don’t exist in new format to “notes” section
Converted tabs to spaces
To Do:
fix leading whitespace on each line
remove all colons that don’t follow keys
remove lines that only contain blank characters
?
Just offering what is actually a refactoring of the OP’s original handler aiming to reduce execution time with some—by now—fairly well known AppleScript techniques plus a reduction in redundant/repeat operations:
to trimlines by phrase as text out of input as text from place : left
local input, phrase, place
if the phrase is not in the input then return the input
set N to 1 + the (phrase's length)
set a to (place contains left) as integer
set b to ((place contains right) as integer) * 2
set i to a + b + 1
script sentences
property list : the input's paragraphs
end script
script trim
on _L(str)
if str starts with the phrase then ¬
return str's text N thru -1
str
end _L
on _R(str)
if str ends with the phrase then ¬
return str's text 1 thru -N
str
end _R
on _LR(str)
tell trim to _L(_R(str))
end _LR
end script
script fn
property func : item i of trim's {_L, _L, _R, _LR}
end script
repeat with sentence in (a reference to the list of sentences)
set {untrimmed, trimmed} to {the sentence's contents, ""}
if phrase ≠ untrimmed then set trimmed to fn's func(untrimmed)
set the sentence's contents to trimmed
end repeat
set text item delimiters to linefeed
return the list of sentences as text
end trimlines
An example call to this handler might be:
trimlines by "some text" out of "some text here.
then some text there.
and also some text.
with some text
but not until
some text is there." from the left & right
which hopefully lets one deduce the nature of each parameter quite easily, but for clarity: the from parameter could also be just one of those AppleScript constants, i.e. from the left or from the right, and will default to left if the parameter is omitted or invalid. The other two parameters are, of course, mandatory. The above call outputs:[format]" here.
then some text there.
and also some text.
with
but not until
is there."[/format]
NB.A call to this handler as it stands will result in text item delimiters being set to linefeed. Some people prefer resetting as they go: I always set them immediately prior to any list → text coercion; adapt as your prefer.
I haven’t benchmarked its performance nor that of the original.
Hmmm. An interesting abuse of the reserved terms left, right, and list. :rolleyes: And I’ve not seen that method of implementing optional parameters before. Apparently it was introduced in OS X Yosemite (10.10) and only works for handlers with labelled parameters. I’ve learned something new today! Fooling around with it, it seems that while any or all of the parameters may be optional, at least one labelled value must be passed in the call to the handler or the handler will simply be returned rather than executed. If all the parameters are optional, the one or more labels used in the call don’t have to match any of those in the handler definition! (Not in Mojave, anyway.)
on fred from c : "Hello" to d : 5
return c
end fred
fred by missing value under "aardvark" --> "Hello"
I’ve just been comparing the speeds of the three handlers above which return the same results: ie. the TIDs method (post #6), the revised ASObjC (post #8), and CK’s optimised repeat (post #11).
On my iMac, they’re all practically the same speed with up to sixteen lines or so. With longer texts, they’re still pretty much the same for user experience, but the ASObjC handler consistently times as the fastest of the three (by a few milliseconds) and the optimised repeat as the faster of the two vanilla methods (ditto, by how much depending on how much needs to be done). But there are a couple of exceptions:
• If no trimming actually occurs, the TIDs handler is suddenly faster than the optimised repeat.
• If a line in the text exactly equals the “phrase” to be cut, the optimised repeat errors.
Hi. I’m not sure that I see the initially posted handler’s—or its descendants’—utility for your situation. Looking at your dropbox file, there are swaths of text with inconsistencies. I would move the project into a grep-aware text editor, such as TextWrangler; this provides an opportunity to learn regular expressions, easily preview changes, and correct spelling and other composition issues. This should get you started:
tell application "TextWrangler"
tell document 1
repeat with this in {{"SOURCE:", "source:"}, {"CATEGORIES:", "categories:"}, {"PREHEAT:", "Preheat to"}, {"METHOD:", "directions: |"}, {"INGREDIENTS:", "ingredients: |"}, {"MAKES:", "Makes"}, {"RECIPE:", "- name:"}, {"¼", "1/4"}, {"½", "1/2"}, {"¾", "3/4"}}
replace (this's item 1) using (this's item 2) searching in it options {starting at top:1, returning results:0} saving no
end repeat
#GREP conversions
--potential whitespace followed by returns to return
--tabs to space
--lines beginning with equals with potential text to return
--word spaces to space
repeat with this in {{"[[:space:]]*\\r+", "\\r"}, {"\\t+", space}, {"^={2,}.*", "\\r"}, {"[ ]{2,}", space}}
replace (this's item 1) using (this's item 2) searching in it options {search mode:grep, starting at top:1, returning results:0} saving no
end repeat
activate
end tell
end tell
The above two amendments have been applied to the “optimised repeat” handler (trimlines) and the original post (post #11) has been edited to reflect these changes.