I’m needing to delete lines containing specific strings in an XML file that is generated every 10 minutes. While I could process it through TextWrangler, I feel this shouldn’t be necessary. I’m already using a Search-and-Replace subroutine and was hoping to find something comparable to delete these unneeded lines. Any ideas?
I wrote you a handler called deleteLinesFromText that will do this. Just feed it the text and the phrase you want and all lines containing that phrase will be deleted. So you should read in the xml file, send it through the handler, then write the results back to the xml file. An xml file is just a text file, so you can read and write from applescript as normal text. You can find information on this website about reading and writing text files.
set fileText to "Here's some text in a file
It contains a few lines of text.
This is a third line of text.
And finally a fouth line of text."
-- this is the pharse will will check against in the fileText
set deletePhrase to "third line"
deleteLinesFromText(fileText, deletePhrase)
on deleteLinesFromText(theText, deletePhrase)
set newText to ""
try
-- here's how you can delete all lines of text fron fileText that contain the deletePhrase.
-- first turn the text into a list so you can repeat over each line of text
set textList to paragraphs of theText
-- now repeat over the list and ignore lines that have the deletePhrase
repeat with i from 1 to count of textList
set thisLine to item i of textList
if thisLine does not contain deletePhrase then
set newText to newText & thisLine & return
end if
end repeat
if newText is not "" then set newText to text 1 thru -2 of newText
on error
set newText to theText
end try
return newText
end deleteLinesFromText
Hank’s script is excellent for removing specific text.
However, depending on your needs (e.g., text in a specific location within a line, or text associated with other text on the same line) you may need to use regular expressions to more flexibly extract or remove matching text lines.
To search with regular expressions, I would recommend the use of the free TextWrangler (or it’s commercial cousin, BBEdit), the free Satimage osaxen, the commercial TextSoap, or the UNIX tools (grep, sed and awk) contained within OSX. All of these tools have their relative strengths, but one thing they have in common is they are all accessible from AppleScript. They all also work with Lion.
Regular expressions are incredibly flexible and powerful at identifying matching patterns of text. They can match different criteria in the same text line (akin to a logical OR), as an example. They are also very fast! A good primer on “regex’s” is contained within TextWrangler’s/BBEdit’s documentation.
Here is an edited version of Hank’s script.
It doesn’t create a new list but work with a single one.
I don’t know if it’s faster but I guess that it’s more efficient in terms of memory use.
set fileText to "Here's some text in a file
It contains a few lines of text.
Hello happy tax payers.
and here is a fake fourth line
Hello angry tax payers too
And finally a sixth line of text."
-- this is the pharse will will check against in the fileText
set deletePhrase to "tax payer"
deleteLinesFromText(fileText, deletePhrase)
on deleteLinesFromText(theText, deletePhrase)
try
-- here's how you can delete all lines of text fron fileText that contain the deletePhrase.
-- first turn the text into a list so you can repeat over each line of text
set textList to paragraphs of theText
-- now repeat over the list and ignore lines that have the deletePhrase
set j to 0
repeat with i from 1 to count of textList
set thisLine to item i of textList
if thisLine does not contain deletePhrase then
set j to j + 1
set item j of textList to thisLine
end if
end repeat
if j < i then set theText to my recolle(items 1 thru j of textList, return)
end try
return theText
end deleteLinesFromText
--=====
on recolle(l, d)
local oTIDs, t
set oTIDs to AppleScript's text item delimiters
set AppleScript's text item delimiters to d
set t to l as text
set AppleScript's text item delimiters to oTIDs
return t
end recolle
--=====
Yvan KOENIG (VALLAURIS, France) samedi 7 janvier 2012 17:24:37
I was going to weigh in some suggestions this morning, but changed my mind because Hank had produced a simple script which does exactly what the original poster wanted and in a way the OP could probably understand.
With a short text, no efficiency measures will make any noticeable difference. With a substantial text, there are one or two things you could do. For instance, if (as seems likely) there are only a few instances of the offending phrase in the text, it would be better in the repeat to act only when the line does contain the phrase rather than when it doesn’t. The concatenations could also be saved for a single mass list-to-text coercion at the end:
set fileText to "Here's some text in a file
It contains a few lines of text.
This is a third line of text.
And finally a fourth line of text."
-- this is the phrase we'll check against in the fileText
set deletePhrase to "third line"
deleteLinesFromText(fileText, deletePhrase)
on deleteLinesFromText(theText, deletePhrase)
-- here's how you can delete all lines of text fron fileText that contain the deletePhrase.
-- first turn the text into a list so you can repeat over each line of text
set textList to paragraphs of theText
-- now repeat over the list and replace lines that have the deletePhrase with 'missing values'.
repeat with i from 1 to (count textList)
if (item i of textList contains deletePhrase) then set item i of textList to missing value
end repeat
-- Coerce the paragraphs which are left to a single text using return delimiters.
set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to return
set newText to textList's text as text
set AppleScript's text item delimiters to astid
return newText
end deleteLinesFromText
Other approaches include reducing the number of concatenations by concatenating entire clear sections rather than individual paragraphs:
set fileText to "Here's some text in a file
It contains a few lines of text.
This is a third line of text.
And finally a fourth line of text."
-- this is the phrase we'll check against in the fileText
set deletePhrase to "third line"
deleteLinesFromText(fileText, deletePhrase)
on deleteLinesFromText(theText, deletePhrase)
set newText to ""
-- here's how you can delete all lines of text fron fileText that contain the deletePhrase.
-- first turn the text into a list so you can repeat over each line of text
set textList to paragraphs of theText
-- now repeat over the list and concatenate the chunks between the lines containing the deletePhrase.
set i to 1
repeat with j from 1 to (count textList)
if (item j of textList contains deletePhrase) then
if (j > i) then set newText to newText & text from paragraph i to paragraph (j - 1) of theText & return
set i to j + 1
end if
end repeat
if (j ≥ i) then
set newText to newText & text from paragraph i to paragraph j of theText
else if (newText ends with return) then
set newText to text 1 thru -2 of newText
end if
return newText
end deleteLinesFromText
Or you could give the repeat fewer iterations by splitting the text at the phrase instances rather than at the line ends:
set fileText to "Here's some text in a file
It contains a few lines of text.
This is a third line of text.
And finally a fourth line of text."
-- this is the phrase we'll check against in the fileText
set deletePhrase to "third line"
deleteLinesFromText(fileText, deletePhrase)
on deleteLinesFromText(theText, deletePhrase)
set newText to {}
-- Get the text items of the text using the deletePhrase as a delimiter.
set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to deletePhrase
set textItems to theText's text items
set textItemCount to (count textItems)
if (textItemCount > 1) then
-- The phrase was in the text. Collect the text items except for the bits of the paragraphs it was in.
tell beginning of textItems to if ((count each paragraph) > 1) then set end of newText to text 1 thru paragraph -2
repeat with i from 2 to textItemCount - 1
tell item i of textItems to if ((count each paragraph) > 1) then set end of newText to text from paragraph 2 to paragraph -2
end repeat
tell end of textItems to if ((count each paragraph) > 1) then set end of newText to text from paragraph 2 to -1
set AppleScript's text item delimiters to return
set newText to newText as text
else
-- The phrase is in the text.
set newText to theText
end if
set AppleScript's text item delimiters to astid
return newText
end deleteLinesFromText
Here is Hank’s original script, modified to use grep (a built-in Unix tool), which would allow the use of regular expressions to identify lines to delete.
Note that the environment variable needs to be passed to the “do shell script” in order that the grep command can be run. The shell which is invoked by the do shell script doesn’t inherit the environment variables which populate the shell used by the Terminal program.
Also, the UNIX environment does better with linefeeds in its data; therefore, the subroutine changes return characters to linefeeds, prior to processing the do shell script command.
The strategy behind the shell command is to echo the text and pipe the output to the grep command. This should work for any variable containing text. The command needs to be altered to work with files, but that is fairly trivial.
Consider changing the “third” in the following line of the script:
set deletePhrase to "third"
to any of the following:
“a.*?line” - to remove lines containing “a … line”
“^#” - to remove lines beginning with a #
Other regex’s (regular expressions) should also work.
set fileText to "Here's some text in a file
It contains few lines of text.
This is the third line of text.
# a comment line!
A line with the # (pound) symbol.
And finally a last line of text."
-- this is the phrase we will check against in the fileText
set deletePhrase to "third"
set theAns to deleteLinesFromText(fileText, deletePhrase)
on deleteLinesFromText(theText, deletePhrase)
set theEnv to "export PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:"
set theText to switchText from theText to linefeed instead of return
set newText to ""
try
set theCmd to "echo " & quoted form of theText & " | grep -Ev " & quoted form of deletePhrase & ""
set newText to do shell script theEnv & "; " & theCmd
on error
set newText to "ERROR: " & theText
end try
return newText
end deleteLinesFromText
to switchText from t to r instead of s
set d to text item delimiters
set text item delimiters to s
set t to t's text items
set text item delimiters to r
tell t to set t to item 1 & ({""} & rest)
set text item delimiters to d
t
end switchText
Nigel’s post is wonderfully thorough and logical. Moreover, it stays within AppleScript and doesn’t use any other software. I provide my solution only for those individuals looking to use regular expressions in their search criteria. It uses the grep available as part of OSX.
Thanks for this powerful script but I am like an umbrella in front of a sewing machine.
I really don’t understand what is doing the piece of code theEnv which is triggered twice.
Yvan KOENIG (VALLAURIS, France) samedi 7 janvier 2012 21:06:08
That was my mistake, which I’ve edited in the post above. Similarly, I’ve moved the conversion of returns to linefeeds into the subroutine.
theEnv is used to set the environment of the “do shell script” call. In OSX, the sh shell is called by default, when the do shell script is used in AppleScript. However, do shell script doesn’t use the environment variables of the shell used by the Terminal program. By prefixing the command in theEnv to any other commands used during the do shell script call, the environment variables of that particular call are set.
Note that theEnv contains the paths to the folders which contain the system UNIX executables (change this to fit your system’s paths). PATH is the environment variable of the sh shell which is launched by the do shell script command.
What need for the tell t to set. esoteric instruction in your handler ?
to switchText from t to r instead of s
set d to text item delimiters
set text item delimiters to s
set t to t's text items
set text item delimiters to r
tell t to set t to item 1 & ({""} & rest)
set text item delimiters to d
t
end switchText
Isn’t it doing the same than :
to switchText from t to r instead of s
set d to text item delimiters
set text item delimiters to s
set t to t's text items
set text item delimiters to r
set t to t as text
set text item delimiters to d
t
end switchText
Yvan KOENIG (VALLAURIS, France) samedi 7 janvier 2012 21:51:48
That subroutine belongs to Nigel and kai (http://macscripter.net/viewtopic.php?id=13008) who had written it during the days of ASCII and Unicode text usage (about 2005) in AppleScript. It preserved either type of encoding.
I would imagine that since AppleScript text is now Unicode, the distinction is no longer an issue, and your revision is appropriate. I wonder what Nigel has to say about this?
As I see Eric has just said, it’s a techinique invented by Kai Edwards a few years ago when AppleScript differentiated between ‘string’ and ‘Unicode text’. If you used ‘as text’ or ‘as Unicode text’, the class of the result would be whatever was specified by the coercion. But if the rest of the list was concatenated to the first item, the result would be the same class as the original text. The list with the empty string mimicked an empty text item, so that an instance of the delimiter would be inserted there during the implicit coercion caused by the concatenation to item 1.
(1) what need for the & “” at the end of the instruction :
set theCmd to "echo " & quoted form of theText & " | grep -Ev " & quoted form of deletePhrase & “”
(2) what need for the E option in the same instruction ?
I understood that adding the option e would be useful in case of key string embedding the - (minus) character
but during my tests, the code behaved the same with or without the E.
the & “” at the end of the command is merely a place-holder, and unnecessary (I use this code as a template for other subroutines).
Not all grep programs are the same. Some only offer “basic” functionality. The “-E” option offers extended functionality, which may not be available on all platforms or versions of grep.
I have not tested this option on OSX, but if you find it is not necessary for the execution of your code, then don’t use it. It’s presence (or absence) should have very little effect on the speed of the subroutine.
An AppleScript solution to your question about extracting indices follows:
set fileText to "Here's some text in a file
It contains few lines of text.
This is the third line of text.
# a comment line!
Here is a # (pound) symbol.
And finally a fourth line of text."
-- this is the phrase we will check against in the fileText
set deletePhrase to "^#"
set theAns to deleteLinesFromText(fileText, deletePhrase)
set theAnsList to the paragraphs of theAns
set theListCount to the count of theAnsList
repeat with iCtr from 1 to theListCount
set item iCtr of theAnsList to item 1 of stringToList(item iCtr of theAnsList, ":")
end repeat
on deleteLinesFromText(theText, deletePhrase)
set theEnv to "export PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:"
set theText to switchText from theText to linefeed instead of return
set newText to ""
try
set theCmd to "echo " & quoted form of theText & " | grep -Evn " & quoted form of deletePhrase & "" --& quoted form of deletePhrase & ""
set newText to do shell script theEnv & "; " & theCmd
on error
set newText to "ERROR in deleteLinesFromText: " & return & theText
end try
return newText
end deleteLinesFromText
to switchText from t to r instead of s
-- keeps encoding of individual list items intact by using {""}
set d to text item delimiters
set text item delimiters to s
set t to t's text items
set text item delimiters to r
tell t to set t to beginning & ({""} & rest)
set text item delimiters to d
t
end switchText
on stringToList(theString, theDelimiter)
set oldTID to AppleScript's text item delimiters
set AppleScript's text item delimiters to theDelimiter
set resultList to text items of theString
set AppleScript's text item delimiters to oldTID
return resultList
end stringToList
You can balance this approach for one which will be considerably faster on very long lists. It uses more memory by creating a second list, but is faster by eliminating the subroutine call which parses each list item into the index and the remainder of item’s data. I list only the main routine code here as the subroutines are unchanged from the code above:
set fileText to "Here's some text in a file
It contains few lines of text.
This is the third line of text.
# a comment line!
Here is a # (pound) symbol.
And finally a fourth line of text."
-- this is the phrase we will check against in the fileText
set deletePhrase to "^#"
set theAns to deleteLinesFromText(fileText, deletePhrase)
set theAnsList to stringToList(theAns, {":", return})
set theListCount to the count of theAnsList
set the newList to {}
repeat with iCtr from 1 to theListCount by 2
set the end of my newList to item iCtr of theAnsList
end repeat
The ruse of this approach is that a string can be transformed to a list by using a list of multiple delimiters, which allows isolation of the indices based on the location of the index between the line return and the colon. Unfortunately, a list item cannot be effectively deleted, but this is handled by looping through the list, and taking every other item – which happen to be the indices – and placing them in a new list. If the list is very long, then the “my” construct, used with newList, speeds up the process considerably.
Finally, if you were to use UNIX, then piping the grep output to “sed” would work also.
Splitting the lines returned by the call to deleteLinesFromText was what I used.
I was just wondering if there was a way to do that thru a parameter which I missed in the main call.
I learnt something.
I thought that my was fastening the treatment of lists when these ones were defined as properties/globals
Yvan KOENIG (VALLAURIS, France) lundi 9 janvier 2012 16:21:24
Changing the appropriate line in the deleteLinesFromText handler would allow you to accomplish this in UNIX. This would be the replacement line:
This pipes the grep output to sed which looks for any characters after the colon and replaces them with an empty string. You can change the handler itself to create a parameter which would allow you to add the sed string programmatically, which may make the handler more generically useful.
set fileText to "Here's some text in a file
It contains few lines of text.
This is the -third line of text.
# a comment line!
A line with the # (pound) symbol.
And finally a last line of text."
-- this is the phrase we will check against in the fileText
set keyString to "-third"
(*
The last parameter may be :
"00" : Extract every line which doesn't contain keyString
"01" : idem + put the line index (in the source) at front of the line
"02" : return index of every line which doesn't contain keyString
"10" : Extract every line which contain keyString
"11" : idem + put the line index (in the source) at front of the line
"12" : return index of every line which contains keyString
*)
set theAns to deleteLinesFromText(fileText, keyString, "02")
on deleteLinesFromText(theText, keyString, flag)
set theEnv to "export PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:"
set theText to switchText from theText to linefeed instead of return
set newText to ""
try
if flag = "00" then
set theCmd to "echo " & quoted form of theText & " | grep -Eve " & quoted form of keyString & ""
else if flag = "01" then
(*
Extract every line which doesn't contain keyString
*)
set theCmd to "echo " & quoted form of theText & " | grep -Evne " & quoted form of keyString & ""
else if flag = "02" then
(*
Supposed to return index of lines without the keyString *)
set theCmd to "echo " & quoted form of theText & " | grep -Evne " & quoted form of keyString & " | sed s/:.*//"
else if flag = "10" then
(*
Extract lines containing keyString *)
set theCmd to "echo " & quoted form of theText & " | grep -Ee " & quoted form of keyString & ""
else if flag = "11" then
(*
Extract lines containing keyString
and put the line index (in the source) at front of the line *)
set theCmd to "echo " & quoted form of theText & " | grep -Ene " & quoted form of keyString & ""
else if flag = "12" then
(*
Extract lines containing keyString
and put the line index (in the source) at front of the line *)
set theCmd to "echo " & quoted form of theText & " | grep -Ene " & quoted form of keyString & " | sed s/:.*//"
end if
set newText to do shell script theEnv & "; " & theCmd
on error
set newText to "ERROR: " & theText
end try
return newText
end deleteLinesFromText
to switchText from t to r instead of s
set d to text item delimiters
set text item delimiters to s
set t to t's text items
set text item delimiters to r
tell t to set t to item 1 & ({""} & rest)
set text item delimiters to d
t
end switchText
Yvan KOENIG (VALLAURIS, France) lundi 9 janvier 2012 18:40:03
Here is another way to do it, with a general handler. Matt Neuburgs filter handler. Though Nigel’s is my favorite. It may not be the most readable, but its usefullness by far outweighs any negativisms!
on run
global searchterm
set fileText to "Here's some text in a file
It contains few lines of text.
This is the third line of text.
# a comment line!
Here is a # (pound) symbol.
And finally a fourth line of text."
set searchterm to "third line"
set {tids, AppleScript's text item delimiters} to {AppleScript's text item delimiters, linefeed}
set thetext to text items of fileText
set newtext to _filter(thetext, notContains)
log newtext
set newtext to text items of newtext as text
set AppleScript's text item delimiters to tids
log newtext
end run
on _filter(L, crit)
script filterer
property criterion : crit
on _filter(L)
if L = {} then return L
if criterion(item 1 of L) then
return {item 1 of L} & (_filter(rest of L))
else
return _filter(rest of L)
end if
end _filter
end script
return filterer's _filter(L)
end _filter
on notContains(x)
global searchterm
if x does not contain searchterm then return true
return false
end notContains