Sed Context Grep

McUsrII · May 11, 2013, 12:46am

Hello.

I have actually gotten hold some fast 30 ines of c, so I could have had implemented it in Applescript, but I provide the ink here should anybody feel indulged.

I use sed, because it is fast, it is not bloated like perl, but has really a small command set, that must be practiced now and then. It is really simple, just a loop, a pattern space, and a holdspcace. But the terseness, and the abstraction level combined, makes it a little bit tricky. I never fiddle with it in Applescript before I have gotten it right in a Terminal, because the edit-test cycle is shorter there, even if it is in a sed script.

kel1 · May 11, 2013, 1:11am

McUsr, your grasp of what’s going on is amazing. Tricky is good and it makes it fun. There are many ways to do things.

I never liked c in school. I liked Pascal better back then. The wording was more like a higher level language. Maybe that’s why I never became a programmer.

I don’t know why I started reminiscing, but one day I was on the internet back in about1981. Trying to connect to a game. When I first saw typing on my screen, it was great. The site was giving directions, all text. Compared to now, it’s unreal man.

Editted: no it was about 1984.

DJ_Bazzie_Wazzie · May 11, 2013, 6:37am

Are you sure about that? Pascal is the same level as C and the performance depends on the compiler. Why Pascal is fading away is simply because Apple stopped developing their OS at the early '90s in Pascal which makes it no longer an OS language. Pascal has proven itself, in the time that every bit and cycle counts, to be good as C.

McUsrII · May 11, 2013, 9:25am

Hello.

There is a matter of fact a good free Pascal compiler out there, I haven’t had the time to check it out, GNU Pascal, that supports almost every pascal dialect there were. I’d look for that if you want to have a go with it.

The problem with pascal as I see it, is the lack of standardized libraries, and that it is a kind of theoretical language. It is nicer than C, but the “clean” approach makes it, (or made it when I used it), much more limiting.

Yesterday, I wrote a c-program, that gotoe’d into the middle of a case statement, and installed a longjumphandler, gotoed out again, the longjump was later being used from within a signal, that were issued, when the program was awakened again, after it had suspended itself. It worked! Try that in Pascal! (You can probably do that but.).

DJ_Bazzie_Wazzie · May 11, 2013, 11:21am

It is possible within Pascal but like in many Programming language (including C) a goto is considered as lack of design and bad software development and should be the programmer’s last choice. So if goto in Pascal is easier or worse than in C, it still is considered as equally bad/poor designed. However some programming language don’t have a block design like C or Pascal and uses goto (on line number) instead like in BBC Basic.

McUsrII · May 11, 2013, 11:25am

Hello.

Sometimes a goto is the solution, sometimes a longjump, but they should be avoided if possible, and used sparsely, they are ok, when they simplify things in total, and I guess that is why it is included. It is not to be used like in BBC basic. But only when there aren’t any good structured solution.

Nigel_Garvey · May 11, 2013, 11:35am

Hi.

I’m not sure this is the best way to do it, but it seems to work with the given text and variations. Not quite the same as grep -C though. I’ve left it uncommented.

set t to quoted form of "The
rain
drain
in
Spain
stays
mainly
in
the
plain"
set cmd to "echo " & t & " | sed -n '
1 h
2,$ {
	/ain/ !{
		x
		/ain/ {
			G
			p
			$ !s/^.*$/---/p
			c\\'$'\\n''
		}
	}
	/ain/ {
		x
		/ain/ {
			G
			p
			s/^[^[:cntrl:]]*\\n//
			h
			$ !s/^.*$/---/p
			$ c\\'$'\\n''
		}
		/---/ !{
			s/^[^[:cntrl:]]*\\n//
			G
			h
		}
	}
}
$ {
	g
	/ain/ p
}'"
do shell script cmd

McUsrII · May 11, 2013, 12:25pm

kel1 · May 12, 2013, 7:18am

I finally got how to make it loop at the right place in the script, so it could print the lines before and after the pattern(s). Not sure, if this is how grep works when there are consecutive matching lines. I need to check out how Nigel’s results look.


set t to quoted form of "The
rain
train
in
Spain
stays
mainly
in
the
stain
plain."
set cmd to "echo " & t & " | sed -n '
/ain/ !{
x
d
}
/ain/ {
x
# print the line before
p
# get a copy of the matching lline
x
# print the matching line
p
:loop
# clear pattern space and get next line
n
# print next line
p
# branch if line contains pattern
/ain/ b loop
# print divider
a\\
---
x
}'"
do shell script cmd

Thanks a lot,

Nigel_Garvey · May 12, 2013, 8:19am

Your results are closer to grep’s than mine. It merges overlapping contexts.

set t to quoted form of "The
rain
train
in
Spain
stays
mainly
in
the
stain
plain."
set cmd to "echo " & t & " | grep -C1 'ain'"
do shell script cmd

McUsrII · May 12, 2013, 9:28am

Hello.

It is very admirable kel!

Here is a technique for using sed, I have been playing with, in order to make it more versatile, (or to use as little as possible of it.)

The idea here, is to use sed in combination with other tools, where you really have as much as you can of precomposed input in separate files, and leverages upon that you can specifiy stdin as either “-” /dev/fd/0 or /def/stdin, (one of these normallly works.).

Ok, so I have a preamble of html, called file1, containing up to and including the body tag, and I have a “postamble”, containting the closing body tag and the rest of it, and a list, of say greek letters spelled out in latin, and the sed script below to make an unordered list:

[code]#!/usr/bin/sed -nf
1i\

s_^.*_

&

\[/code] Then, in order to generate an html file with the result, I may use a commandline like this:

cat alfabet |./ul.sed |cat file1 - file2 >alfa.html

DJ_Bazzie_Wazzie · May 12, 2013, 6:36pm

In AWK it can be like with overlap:


do shell script "awk 'BEGIN{first=1}/ain/{
	if (first==0)
		print \"---\"
	print head
 	print $0
	getline
	print $0
	first=0 };{head=$0}' <<< " & t

or without overlap:


do shell script "awk 'BEGIN{first=1}/ain/{
	if (first==0)
		print \"---\"
	print head
 	print $0
	getline
	print $0
	getline
	first=0 };{head=$0}' <<< " & t

McUsrII · May 12, 2013, 8:02pm

One day soon, I’ll write one in C that mimick’s Nigel Garvey’s version in sed, but with filenames and line numbers.

cgrep is a tool I hopefully don’t use to much, but it is really handy when you are tracking a variable or something, as useful as diff, for detecting changes between two versions.

It feels kind of weird for starters, to just look at fragments of files, but I experienced that as a boon once I got used to it: I don’t have to navigate, and I only look at the parts that interests me, there and then.

kel1 · May 12, 2013, 8:22pm

Hi,

Nigel,

Your script was actually what I was thinking it should be with the text as is. When I add a matching pattern in the first line:


set t to quoted form of "The brain
rain
drain
in
Spain
stays
mainly
in
the
plain"
set cmd to "echo " & t & " | sed -n '
1 h
2,$ {
	/ain/ !{
		x
		/ain/ {
			G
			p
			$ !s/^.*$/---/p
			c\\'$'\\n''
		}
	}
	/ain/ {
		x
		/ain/ {
			G
			p
			s/^[^[:cntrl:]]*\\n//
			h
			$ !s/^.*$/---/p
			$ c\\'$'\\n''
		}
		/---/ !{
			s/^[^[:cntrl:]]*\\n//
			G
			h
		}
	}
}
$ {
	g
	/ain/ p
}'"
do shell script cmd

The second and third matches don’t come out right.

Hi DJ Bazzie Wazzie,

I have a new library; awk. Thought I’d start with sed first because I’ve tried to learn it before, but the tutorials on awk seemed complicated. Thanks for the intro.

Well, my biorythm’s intellectual line is going up (just passed zero), so now is the time to learn this stuff.

kel1 · May 12, 2013, 9:11pm

Just remembered what I read on the internet somewhere, that grep cannot print the same lines more then once. That’s why the grep output is incomplete and maybe has those double dividers sometimes.

Editted: I don’t know what the person meant by this. Grep seems to print the same line. Need to read that post again. if I can find it. I probably read it wrong.

Nigel_Garvey · May 12, 2013, 9:21pm

I could have sworn I tested with that before I posted. Obviously not. Sorry. I’ll try and fix it ” but not tonight.

DJ_Bazzie_Wazzie · May 13, 2013, 12:15pm

You’re welcome.

It’s good to start with regular expressions first and then start with AWK. AWK, the predecessor of Perl, is much more versatile which can lead you to a leap in the dark. When you’re already used to C-style syntaxes and write in a more decompressed style, AWK scripts are easier to read than sed IMO. The latest version of Kernighan (who also developed C) is considered as the one and only true AWK version which is standard on FreeBSD (also Mac OS X). I would also recommend to book associated to this version AWK, ISBN 0-201-07981-X (which can be found at the bottom in the AWK man page), which tells you everything about AWK what you should know about it.

Nigel_Garvey · May 13, 2013, 12:37pm

OK. A slight rethink wherein the hold space acts similarly to a three-line FIFO stack, except that the lines are all retrieved at once and only the last two are put back before another is added. The stack is output complete if its penultimate line contains “ain”, with adjustments at the last line. It seems to work.

set t to quoted form of "The brain
rain
drain
in
Spain
stays
mainly
in
the
Staines
plain
again
Jane

again"

set cmd to "echo " & t & " | sed -En '
# Move the first line to the "stack" (hold space)
1 h
# With each of the following lines in turn:
1 !{
	# Append the line to the stack and retrieve the stack contents.
	H
	g
	# If the penultimate retrieved line contains "ain", print them all and, if not at the last line of text, output a separator.
	/ain[^[:cntrl:]]*\\n[^[:cntrl:]]*$/ {
		p
		$ !i\\'$'\\n''---
	}
	# If three lines were taken from the stack, lose the first one and push the other two back.
	/^[^[:cntrl:]]*\\n([^[:cntrl:]]*\\n[^[:cntrl:]]*)$/ {
		s//\\1/
		h
	}
}
# At the last line of the text, after the above processes, the pattern space contains either the first line (if there's only one) or the last two (the edited stack contents).
$ {
	# If the line contains "ain":
	/ain[^[:cntrl:]]*$/ {
		# If there's a line before it also containing "ain", output another separator.
		/ain[^[:cntrl:]]*\\n/ i\\'$'\\n''---
		# Print the pattern space contents.
		p
	}
}'"

do shell script cmd

Edit: Comments revised.

McUsrII · May 13, 2013, 1:19pm

The lines aren’t numbered. :lol:

Just kidding. It’s just Brilliant!

McUsrII · May 13, 2013, 2:55pm

Actually, I figured, I could write what I wanted faster in sed, and implement it in a shell script, rather than write it in c. (The sole intent is to have a tool to trace a variable/function call throughout a source code file.)

What I want, is a short context, and that the context is to be broken, when a new match appear, and I want line numbers.

Given the latest “poem” from the post above, the output looks like this:

[code] 1 The brain

 2	rain

 3	drain
 4	in

 5	Spain
 6	stays

 7	mainly
 8	in

 9	the
10	Staines

11	plain

12	again
13	Jane

14	
15	again"[/code]

I implemented it as a shell script, which should be easy to use from a do shell script, the regexp is given as a parameter quoted with double quotes on the commandline, and the input must be redirected into it.

[code]#!/bin/bash
if [ $# -ne 1 ] ; then
echo "Usage: cgrep "pattern" <input

Prints 1 line before and after a match,
and adds three dashes to split the
sequences. Linenumbers for a match is
added.
Cgrep breaks the context if the next
line is a match by itself.
" >/dev/tty
exit 2
fi
nl -b a |/usr/bin/sed -n ’
1 {
/‘“$1”‘n/! h
/’“$1”’/ {
p
a
—\

n
}

}
:op
1! {
/‘“$1”’/! {
h
n
}
/‘“$1”’/ {
x
/‘“$1”’/ {
g
p
}
/‘“$1”’/! {
p
g
p
}
n
/‘“$1”’/ {
i
—\

		b op
	}
	/'"$1"'/! {
	p
	a\

—\

}
}

}
'[/code]