Thanks I forgot to put back the literal period.
Just for the fun of it. doing the same as grep -o inline with sed, without tr.
[code]#!/usr/bin/sed -nf
:start
We jump back here, as long as we aren’t done with a line
h
We save a copy of the line (pattern space) we are about to process.
s/(.[^a-zA-Z0-9._%±])([[:<:]][a-zA-Z0-9._%±]{1,}@{1}[a-zA-Z0-9.-]{1,}.[a-zA-Z]{2,4}[[:>:]])([^a-zA-Z].)/\2/p
we find a valid email address.
t purge
we did find it so we’ll purge it from the line of text in holdspace.
n
we didn’t substitute anything so we fall down here, having grabbed the (n)ext line.
b start
we branch back to top.
:purge
g
purging, we get the pristine copy from holdspace into pattern space.
s/(.[^a-zA-Z0-9._%±])([[:<:]][a-zA-Z0-9._%±]{1,}@{1}[a-zA-Z0-9.-]{1,}.[a-zA-Z]{2,4}[[:>:]])([^a-zA-Z].)/\1\3/
we remove the found mail address we printed out from pattern space.
b start
we jump back to start looking for more mail-addresses on the same line.[/code]
Due to greediness by the regex, and lazyiness by me, the mailaddresses comes out in wrong order when more than one on a line.
The “grep” solution is the best for this problem, but like McUsr, I’ve been trying ” just for the hell of it ” to find an economical “sed” way to do it.
An easy way would be have two passes: the first inserting linefeeds immediately before and after each e-mail address and the second printing those lines of the result which contain the addresses:
set myText to "This is Fred's email address: <Fred.Jones@fibble.co.uk>.
applescript-users@lists.apple.com
rhubarb
hello <john@yahoo.com>, steve@apple.com " & tab & "- "
do shell script ("<<<" & quoted form of myText & " sed -En '/[[:<:]][[:alnum:]._%+-]+@[[:alnum:].-]+\\.[[:alpha:]]{2,4}[[:>:]]/ { s//\\'$'\\n''&\\'$'\\n''/g ; p ; }' | sed -En '/^[[:alnum:]._%+-]+@[[:alnum:].-]+\\.[[:alpha:]]{2,4}$/ p ;'")
A one-pass method’s a bit harder ” not least because there appears to be a bug in “sed” (as implemented in Snow Leopard) whereby if text is skipped with a “not linefeed” character class (“[^\n]*”), any “@” character in the way will be interpreted as a linefeed and sabotage the scan. The negative class “[^[:cntrl:]]” would be an acceptable substitute here, except that it trips on tabs and returns. The safest alternative I’ve found so far is “[[:print:]‘$’\t\r’']”. This represents any printable character, plus the “control” characters tab and return, here provided as literal characters by the shell.
set myText to "This is Fred's email address: <Fred.Jones@fibble.co.uk>.
applescript-users@lists.apple.com
rhubarb
hello <john@yahoo.com>, steve@apple.com " & tab & "- "
do shell script ("<<<" & quoted form of myText & " sed -En '/[[:<:]][[:alnum:]._%+-]+@[[:alnum:].-]+\\.[[:alpha:]]{2,4}[[:>:]]/ { s//\\'$'\\n''&\\'$'\\n''/g ; s/[[:print:]'$'\\t\\r'']*\\n([[:print:]'$'\\t\\r'']+\\n)/\\1/g ; s/\\n[[:print:]'$'\\t\\r'']*$//p ; }'")
Or the same thing with comments:
set myText to "This is Fred's email address: <Fred.Jones@fibble.co.uk>.
applescript-users@lists.apple.com
rhubarb
hello <john@yahoo.com>, steve@apple.com" & tab & "- "
do shell script ("<<<" & quoted form of myText & " sed -En '/[[:<:]][[:alnum:]._%+-]+@[[:alnum:].-]+\\.[[:alpha:]]{2,4}[[:>:]]/ { # If a line contains one or more e-mail addresses .
s//\\'$'\\n''&\\'$'\\n''/g ; # . put a linefeed at the beginning and end of each address .
s/[[:print:]'$'\\t\\r'']*\\n([[:print:]'$'\\t\\r'']+\\n)/\\1/g ; # . delete everything before and between the addresses, leaving just the trailing linefeeds .
s/\\n[[:print:]'$'\\t\\r'']*$//p ; # . delete everything after the last address and print what is left.
}'")
Edits: Apostrophe removed from the last comment in the last script as it was causing an error! Wrong word corrected in the post narrative.
Nigel can read this better than Neo can…
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0 . etc. for several lines.
Neo can’t be much good!
Neo can’t be much good!
Nice solution. I am not going to speculate over how you found the bug (the non linefeed character class).
I realized later, that I could have adjusted my solution’s greedyness, and speeded it up some by seeking for an email address, like you have done.
I didn’t get the idea that I could wrap the email addresses in between line feeds!
As a matter of fact, I didn’t believe it was possible to do it one pass, so your solution is just amazing!