Unix Shell - Sed, Awk, or both?

This is Interesting: Free IT Magazines  
Home > Archive > Unix Shell > January 2006 > Sed, Awk, or both?





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Sed, Awk, or both?
zcui@yahoo.com

2006-01-18, 5:55 pm

I need a script to automatically modify PostScript files. The
PostScript file has patterns like following:

....
TR12 setfont (Page 1 of 4 ) 2 0 12 515 755 st
0 1 550 539 575 514 box
( ) 0 1 12 552 524.5 st
1 5 0 537 526 tr
1 2 180 161 324 sq <--- Pattern (1)
1 2 90 71 330 tr <--- Pattern (1)
1 2 180 196 337 tr <--- Pattern (1)
454 320 446 320 dl
....
TR12 setfont (Page 2 of 4 ) 2 0 12 515 755 st
1 5 0 161 324 tr <--- Pattern (2)
1 5 0 71 330 tr <--- Pattern (2)
1 5 0 196 337 tr <--- Pattern (2)
1 5 45 193 346 sq
....

Pattern (1) is only in Page 1, its column 1-3 and 6 can be something
else.
Pattern (2) is only in Page 2, its column 1-3 and 6 are always same,
like "1 5 0 x y tr"

What I need is
1) Find the Pattern (2) first by /^1 5 0 [1-9][0-9]*[0-9]*
[1-9][0-9]*[0-9]* tr$/, save the column 4 and 5 as a string (pattern
x).
2) Then, search the string (pattern x) and delete the line in Page 1.
3) Delete all the lines found by Pattern (2) on Page 2.

What's the best way to do these?

Thanks for any suggestions.
Scott

Janis Papanagnou

2006-01-18, 8:49 pm

zcui@yahoo.com wrote:
> I need a script to automatically modify PostScript files. The
> PostScript file has patterns like following:
>
> ...
> TR12 setfont (Page 1 of 4 ) 2 0 12 515 755 st
> 0 1 550 539 575 514 box
> ( ) 0 1 12 552 524.5 st
> 1 5 0 537 526 tr
> 1 2 180 161 324 sq <--- Pattern (1)
> 1 2 90 71 330 tr <--- Pattern (1)
> 1 2 180 196 337 tr <--- Pattern (1)
> 454 320 446 320 dl
> ...
> TR12 setfont (Page 2 of 4 ) 2 0 12 515 755 st
> 1 5 0 161 324 tr <--- Pattern (2)
> 1 5 0 71 330 tr <--- Pattern (2)
> 1 5 0 196 337 tr <--- Pattern (2)
> 1 5 45 193 346 sq
> ...
>
> Pattern (1) is only in Page 1, its column 1-3 and 6 can be something
> else.
> Pattern (2) is only in Page 2, its column 1-3 and 6 are always same,
> like "1 5 0 x y tr"
>
> What I need is
> 1) Find the Pattern (2) first by /^1 5 0 [1-9][0-9]*[0-9]*
> [1-9][0-9]*[0-9]* tr$/, save the column 4 and 5 as a string (pattern
> x).
> 2) Then, search the string (pattern x) and delete the line in Page 1.
> 3) Delete all the lines found by Pattern (2) on Page 2.
>
> What's the best way to do these?
>
> Thanks for any suggestions.
> Scott
>


I'd use awk to solve that.

If I understand your task correct you'll need a two-pass processing; so
call your program, e.g. as in

awk -f yourprog.awk yourdata.ps yourdata.ps

Your awk program needs commands like the subsequent ones... (untested)

Identify the page and memorize the state

/TR12 setfont (Page / { onPage1 = 0; onPage2 = 0 }
/TR12 setfont (Page 1 / { onPage1 = 1; onPage2 = 0 }
/TR12 setfont (Page 2 / { onPage1 = 0; onPage2 = 1 }

In the first pass find pattern only on page 2 and store data for second pass

(NR == FNR) && onPage2 && /^1 5 0 [1-9][0-9]*[0-9]*/ { store[$4,$5] }

In the second pass suppress on page 1 output of stored patterns

(NR != FNR) && onPage1 && (($4,$5) in store) { next }

Print all the rest in the second pass

NR != FNR

The complete awk program are the six lines above, and, as I said, untested.
It might already work as you expect, but if not it might give you at least
some hints how to approximate this type of problems using awk.

Hope that helps.

Janis
hq00e

2006-01-19, 2:57 am


zcui@yahoo.com wrote:
> I need a script to automatically modify PostScript files. The
> PostScript file has patterns like following:
>
> ...
> TR12 setfont (Page 1 of 4 ) 2 0 12 515 755 st
> 0 1 550 539 575 514 box
> ( ) 0 1 12 552 524.5 st
> 1 5 0 537 526 tr
> 1 2 180 161 324 sq <--- Pattern (1)
> 1 2 90 71 330 tr <--- Pattern (1)
> 1 2 180 196 337 tr <--- Pattern (1)
> 454 320 446 320 dl
> ...
> TR12 setfont (Page 2 of 4 ) 2 0 12 515 755 st
> 1 5 0 161 324 tr <--- Pattern (2)
> 1 5 0 71 330 tr <--- Pattern (2)
> 1 5 0 196 337 tr <--- Pattern (2)
> 1 5 45 193 346 sq
> ...
>
> Pattern (1) is only in Page 1, its column 1-3 and 6 can be something
> else.
> Pattern (2) is only in Page 2, its column 1-3 and 6 are always same,
> like "1 5 0 x y tr"
>
> What I need is
> 1) Find the Pattern (2) first by /^1 5 0 [1-9][0-9]*[0-9]*
> [1-9][0-9]*[0-9]* tr$/, save the column 4 and 5 as a string (pattern
> x).
> 2) Then, search the string (pattern x) and delete the line in Page 1.
> 3) Delete all the lines found by Pattern (2) on Page 2.


It can be done both with sed and awk. Here is a 2-pass sed solution
(you may need to do some adjustment to fit your situation).

$ sed -f <(sed -n -e '/(Page 2/,/(Page 3/{s:^1 5 0 \(.*\) tr.*:/\1/d:p'
-e } pfile) pfile
TR12 setfont (Page 1 of 4 ) 2 0 12 515 755 st
0 1 550 539 575 514 box
( ) 0 1 12 552 524.5 st
1 5 0 537 526 tr
454 320 446 320 dl
....
TR12 setfont (Page 2 of 4 ) 2 0 12 515 755 st
1 5 45 193 346 sq
....

The logic is straight forward. Firstly make a sed script from the input
file:

/161 324/d
/71 330/d
/196 337/d

Then with the script, we can delete all lines (both in page1 and page2)
contains the pattern (2). To get an more accurate result try to
generate a script like this by your self:
/^\(.[^ ]* \)\{3\}161 324 /d
/^\(.[^ ]* \)\{3\}71 330 /d
/^\(.[^ ]* \)\{3\}196 337/d

--
Regards,
hq00e

William James

2006-01-19, 2:57 am

zcui@yahoo.com wrote:
> I need a script to automatically modify PostScript files. The
> PostScript file has patterns like following:
>
> ...
> TR12 setfont (Page 1 of 4 ) 2 0 12 515 755 st
> 0 1 550 539 575 514 box
> ( ) 0 1 12 552 524.5 st
> 1 5 0 537 526 tr
> 1 2 180 161 324 sq <--- Pattern (1)
> 1 2 90 71 330 tr <--- Pattern (1)
> 1 2 180 196 337 tr <--- Pattern (1)
> 454 320 446 320 dl
> ...
> TR12 setfont (Page 2 of 4 ) 2 0 12 515 755 st
> 1 5 0 161 324 tr <--- Pattern (2)
> 1 5 0 71 330 tr <--- Pattern (2)
> 1 5 0 196 337 tr <--- Pattern (2)
> 1 5 45 193 346 sq
> ...
>
> Pattern (1) is only in Page 1, its column 1-3 and 6 can be something
> else.
> Pattern (2) is only in Page 2, its column 1-3 and 6 are always same,
> like "1 5 0 x y tr"
>
> What I need is
> 1) Find the Pattern (2) first by /^1 5 0 [1-9][0-9]*[0-9]*
> [1-9][0-9]*[0-9]* tr$/, save the column 4 and 5 as a string (pattern
> x).
> 2) Then, search the string (pattern x) and delete the line in Page 1.
> 3) Delete all the lines found by Pattern (2) on Page 2.
>
> What's the best way to do these?
>
> Thanks for any suggestions.
> Scott


Using Ruby:

a = [['0']]
while gets
a << [ $1 ] if $_ =~ /^TR12 setfont \(Page (\d) of /
a.last << $_
end
a.assoc('2').reject!{ |s|
if md = /^1 5 0( \d+ \d+ )/.match( s )
a.assoc('1').reject!{ |x| x =~ /^\d+ \d+ \d+#{md[1]}/ }
end
}
a.each{ |x| puts x[1..-1] }

Ed Morton

2006-01-19, 8:11 am

zcui@yahoo.com wrote:

> I need a script to automatically modify PostScript files. The
> PostScript file has patterns like following:
>
> ...
> TR12 setfont (Page 1 of 4 ) 2 0 12 515 755 st
> 0 1 550 539 575 514 box
> ( ) 0 1 12 552 524.5 st
> 1 5 0 537 526 tr
> 1 2 180 161 324 sq <--- Pattern (1)
> 1 2 90 71 330 tr <--- Pattern (1)
> 1 2 180 196 337 tr <--- Pattern (1)
> 454 320 446 320 dl
> ...
> TR12 setfont (Page 2 of 4 ) 2 0 12 515 755 st
> 1 5 0 161 324 tr <--- Pattern (2)
> 1 5 0 71 330 tr <--- Pattern (2)
> 1 5 0 196 337 tr <--- Pattern (2)
> 1 5 45 193 346 sq
> ...
>
> Pattern (1) is only in Page 1, its column 1-3 and 6 can be something
> else.
> Pattern (2) is only in Page 2, its column 1-3 and 6 are always same,
> like "1 5 0 x y tr"
>
> What I need is
> 1) Find the Pattern (2) first by /^1 5 0 [1-9][0-9]*[0-9]*
> [1-9][0-9]*[0-9]* tr$/, save the column 4 and 5 as a string (pattern
> x).
> 2) Then, search the string (pattern x) and delete the line in Page 1.
> 3) Delete all the lines found by Pattern (2) on Page 2.
>
> What's the best way to do these?
>
> Thanks for any suggestions.
> Scott
>


Try this:

$ cat file
....
TR12 setfont (Page 1 of 4 ) 2 0 12 515 755 st
0 1 550 539 575 514 box
( ) 0 1 12 552 524.5 st
1 5 0 537 526 tr
1 2 180 161 324 sq
1 2 90 71 330 tr
1 2 180 196 337 tr
454 320 446 320 dl
....
TR12 setfont (Page 2 of 4 ) 2 0 12 515 755 st
1 5 0 161 324 tr
1 5 0 71 330 tr
1 5 0 196 337 tr
1 5 45 193 346 sq
....

$ cat rmvLines.awk
BEGIN{ ARGV[ARGC++] = ARGV[1]; phase = 0 }
/Page 1/ && (phase == 0) { phase = 1 } # first pass, page 1
/Page 2/ && (phase == 1) { phase = 2 } # first pass, page 2
/Page 3/ && (phase == 2) { phase = 3 } # first pass, page 3+
/Page 1/ && (phase != 1) { phase = 4 } # second pass, page 1
/Page 2/ && (phase == 4) { phase = 5 } # second pass, page 2
/Page 3/ && (phase == 5) { phase = 6 } # second pass, page 3+

{ key = $4 " " $5 }
phase == 2 && /^1 5 0 [1-9][0-9]* [1-9][0-9]* tr$/ { keys[key] }
phase ~ /4|5/ && !(key in keys) { print }
phase ~ /0|6/ { print }

$ awk -f rmvLines.awk file
....
TR12 setfont (Page 1 of 4 ) 2 0 12 515 755 st
0 1 550 539 575 514 box
( ) 0 1 12 552 524.5 st
1 5 0 537 526 tr
454 320 446 320 dl
....
TR12 setfont (Page 2 of 4 ) 2 0 12 515 755 st
1 5 45 193 346 sq
....

Regards,

Ed.
William Park

2006-01-19, 6:24 pm

zcui@yahoo.com wrote:
> I need a script to automatically modify PostScript files. The
> PostScript file has patterns like following:
>
> ...
> TR12 setfont (Page 1 of 4 ) 2 0 12 515 755 st
> 0 1 550 539 575 514 box
> ( ) 0 1 12 552 524.5 st
> 1 5 0 537 526 tr
> 1 2 180 161 324 sq <--- Pattern (1)
> 1 2 90 71 330 tr <--- Pattern (1)
> 1 2 180 196 337 tr <--- Pattern (1)
> 454 320 446 320 dl
> ...
> TR12 setfont (Page 2 of 4 ) 2 0 12 515 755 st
> 1 5 0 161 324 tr <--- Pattern (2)
> 1 5 0 71 330 tr <--- Pattern (2)
> 1 5 0 196 337 tr <--- Pattern (2)
> 1 5 45 193 346 sq
> ...
>
> Pattern (1) is only in Page 1, its column 1-3 and 6 can be something
> else.
> Pattern (2) is only in Page 2, its column 1-3 and 6 are always same,
> like "1 5 0 x y tr"
>
> What I need is
> 1) Find the Pattern (2) first by /^1 5 0 [1-9][0-9]*[0-9]*
> [1-9][0-9]*[0-9]* tr$/, save the column 4 and 5 as a string (pattern
> x).


This is the most difficult part. To get 3 lines from 'TR12 setfont...',
then slice out 'x' and 'y' (column 4 and 5),

sed -n -e '/^TR12 setfont (Page 2 of 4/ {n; N; N; p; q}' \
| awk '{print $4, $5}' > x_y

> 2) Then, search the string (pattern x) and delete the line in Page 1.


Something like
sed '/^TR12 setfont (Page 1 of 4/,/Page 2/ { /.../d; /.../d; /.../d;}'

> 3) Delete all the lines found by Pattern (2) on Page 2.
>
> What's the best way to do these?
>
> Thanks for any suggestions.
> Scott


--
William Park <opengeometry@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
zcui@yahoo.com

2006-01-20, 6:03 pm

Thank you all for the help.

Since I have to modify the data for serval other different patterns
besides the pattern 1 and 2, I finally wrote a PERL program to do the
job.

I'll take your suggestions to try Sed and Awk later when I have time.

Thanks again.
Scott

Ed Morton

2006-01-20, 6:03 pm

zcui@yahoo.com wrote:

> Thank you all for the help.
>
> Since I have to modify the data for serval other different patterns
> besides the pattern 1 and 2, I finally wrote a PERL program to do the
> job.
>
> I'll take your suggestions to try Sed and Awk later when I have time.
>
> Thanks again.
> Scott
>


Would you ming posting the PERL program so we can see how it looks in
contrast to the awk program?

Ed.
James

2006-01-20, 6:03 pm

How about this?

open F,$ARGV[0];
undef %pat;
while (<F> ) {
$N = $1 if /Page (\d+) of /;
$pat{$1} = 1 if $N == 2 && /^1 5 0 (\d+ \d+) tr$/;
}
seek(F,0,0);
while (<F> ) {
$N = $1 if /Page (\d+) of /;
next if $N < 3 && /^\d+ \d+ \d+ (\d+ \d+) / && $pat{$1};
print;
}

It would be nice if awk can refer to the matched pattern directly.

James

Ed Morton wrote:
> zcui@yahoo.com wrote:
>
>
> Would you ming posting the PERL program so we can see how it looks in
> contrast to the awk program?
>
> Ed.


Ed Morton

2006-01-21, 2:49 am

James wrote:
> How about this?
>
> open F,$ARGV[0];
> undef %pat;
> while (<F> ) {
> $N = $1 if /Page (\d+) of /;
> $pat{$1} = 1 if $N == 2 && /^1 5 0 (\d+ \d+) tr$/;


If I understand it correctly, \d represents any digit, but the OP only
wants numbers that start with 1-9, not with zero.

> }
> seek(F,0,0);


I assume the above gets you back to the start of the file, so you're
about to start the second phase of parsing.

> while (<F> ) {
> $N = $1 if /Page (\d+) of /;


Do you need to reset $N to zero before starting this second loop so it
doesn't retain it's value from the first pass for the header text
preceeding "Page 1"?

> next if $N < 3 && /^\d+ \d+ \d+ (\d+ \d+) / && $pat{$1};
> print;
> }



> It would be nice if awk can refer to the matched pattern directly.


Yes, it would. Thanks for posting that (but please don't top-post in
future).

In case anyone else finds it interesting, the equivalent design written
in awk (given the OPs posted input format) would be:

BEGIN{ ARGV[ARGC++] = ARGV[1] }
/Page [[:digit:]] of / { N = $4 }
{ key = $4" "$5 }
NR == FNR {
pat[key] = (N == 2 && /^1 5 0 ([[:digit:]]+ ){2}tr$/ ? 1 : 0)
next
}
N < 3 && /^([[:digit:]]+ ){5}/ && pat[key] { next }
{ print }

The awk ones a little shorter, but there's not a huge difference,
really. The main difference, I think, is that as James pointed out, in
awk you can't refer to matched patterns so I had to explicitly hard-code
the field numbers to create "key" and "N", which was fine in this case
but could present problems in other siutations.

Obviously, if your awk doesn't support RE intervals (supported by gawk
--re-interval, or any POSIX awk), then you need to write the full
"[[:digit:]]+ [[:digit:]]+ [[:digit:]]+ [[:digit:]]+ [[:digit:]]+ "
instead of just "([[:digit:]]+ ){5}", etc.

Regards,

Ed.

> James
>
> Ed Morton wrote:
>
>
>

James

2006-01-23, 6:13 pm


Ed Morton wrote:
> James wrote:
>
> If I understand it correctly, \d represents any digit, but the OP only
> wants numbers that start with 1-9, not with zero.
>
>
> I assume the above gets you back to the start of the file, so you're
> about to start the second phase of parsing.

yes[vbcol=seagreen]
>
>
> Do you need to reset $N to zero before starting this second loop so it
> doesn't retain it's value from the first pass for the header text
> preceeding "Page 1"?
>
>
>
>
> Yes, it would. Thanks for posting that (but please don't top-post in
> future).
>
> In case anyone else finds it interesting, the equivalent design written
> in awk (given the OPs posted input format) would be:
>
> BEGIN{ ARGV[ARGC++] = ARGV[1] }
> /Page [[:digit:]] of / { N = $4 }
> { key = $4" "$5 }
> NR == FNR {
> pat[key] = (N == 2 && /^1 5 0 ([[:digit:]]+ ){2}tr$/ ? 1 : 0)
> next
> }
> N < 3 && /^([[:digit:]]+ ){5}/ && pat[key] { next }
> { print }
>
> The awk ones a little shorter, but there's not a huge difference,
> really. The main difference, I think, is that as James pointed out, in
> awk you can't refer to matched patterns so I had to explicitly hard-code
> the field numbers to create "key" and "N", which was fine in this case
> but could present problems in other siutations.
>
> Obviously, if your awk doesn't support RE intervals (supported by gawk
> --re-interval, or any POSIX awk), then you need to write the full
> "[[:digit:]]+ [[:digit:]]+ [[:digit:]]+ [[:digit:]]+ [[:digit:]]+ "
> instead of just "([[:digit:]]+ ){5}", etc.
>
> Regards,
>
> Ed.
>

Fixes and variations:

open F,$ARGV[0];
undef %pat;
while (<F> ) {
$N = $1 if /Page ([1-9]\d*) of /;
$pat{$1} = 1 if $N == 2 && /^1 5 0 ((\d+ ){2})tr$/;
}

$ARGV[++$#ARGV] = $ARGV[0];
open G,$ARGV[$#ARGV];
undef $N;
while (<G> ) {
$N = $1 if /Page ([1-9]\d*) of /;
next if $N < 3 && /^\d+ \d+ \d+ ((\d+ ){2})/ && $pat{$1};
print;
}


James

zcui@yahoo.com

2006-01-29, 9:31 pm

Ed,

Following is the PERL program I have.

First, find the pattern (in page 1 of the ps file), save the string of
coordinates, remove the lines with the pattern.
Second, remove the lines (in page 2 of the ps file) which matches the
saved coordinates.
Third, find all lines of "4C 0 1 10.*st" or "4D 0 1 10.*st" and its
previous 17 lines (appears from page 4 of the ps file).

The program did remove all the lines I want although the coding is
ugly. However, I found we cannot use the final report from this
modified PS file. The reason is all blocks of lines removed in step
Third belong to a table. After remove those unkonwn number of blocks
there will be a blank space in the table here or there. This means I
must adjust the coordinates of each symbol or lines. This is too much
and hard work.

Finally, I gave up this way and did the job by having a shell script to
generate the PS file directly.

Anyway, here is the code:

#!/usr/bin/perl

$reportdir = ".";
$rptfile = $ARGV[0];
print "$reportdir/$rptfile\n";

if ((-e "$reportdir/$rptfile") && (-s "$reportdir/$rptfile")) {

system("cd $reportdir");

open(FILE, $rptfile);
open(TMP, ">tmp");

print "Start ->$FILE\n";

$found = "false";
$x = 0;
$y = 0;
$xy = "";


while(<FILE> ) {

$line = $_;

#print "Start -> $line\n";

if (($line =~ /1 5 45 562 526 sq/) || ($line =~ /1 5 0 537 526
tr/)) {
print TMP "$line";
} elsif (($line =~ /^1 5 45.*sq/) || ($line =~ /1 5 0.*tr/)){
#1 5 0 359 500 tr
@parts = split(" ", $line);
$x = $parts[3];
$y = $parts[4];

if ( $xy == "" ) {
$xy = "$parts[3]" . " " . "$parts[4]";
} else {
$xy = "$xy" . "," . "$parts[3]" . " " . "$parts[4]";
}
} elsif ($line =~ /\(.*\) 2 1 14 398 178 st/) {
print TMP "\(0\) 2 1 14 398 178 st\n";
} elsif ($line =~ /\(.*\) 2 1 14 338 178 st/) {
print TMP "\(0\) 2 1 14 338 178 st\n";
} else {
print TMP "$line";
}
}
close(FILE);

@coors = split(",", $xy);
foreach $item(@coors) {
open(TMP, "tmp");
open(TMP2, ">tmp2");
while(<TMP> ) {
$line = $_;
if ( $line =~ /.*$item.*/ ) {
print "$line";
} else {
print TMP2 "$line";
}
}
close(TMP);
close(TMP2);
#system("mv", $TMP2, $TMP);
rename("tmp2", "tmp");
}


open(FILE, "tmp");
open(TMP, ">tmp2");

# set it after Page 4
@bols = (0); # First line is always zero
$startpush = 0; # Start check the 4C and 4D
$status = 0; # Start remove the block for 4C and 4D

while (<FILE> ) {
$line = $_;

# write lines to the file
if ($startpush == 0) {
print TMP "$line";
}

if ($line =~ /%%Page: 4 4/) {
$startpush = 1;
}

#Found a 4C/D mark.\n";
if ($line =~ /\(4[CD]\) 0 1 10.*st/) {
$status = 1;
}

if ( $status == 1) {
for ($i=0; $i<18; $i++) {
$pp = pop @bols;

if ($pp =~ /TR10 setfont.*/) {
print TMP "TR10 setfont\n";
}
}

$status = 0;

# Write the left lines to file
if ($#bols > 0) {
foreach $bol (@bols) {
if ($bol != 0) {
seek(FILE, $bol, 0) || die "seek: $!";
#print scalar <FILE>;
$curline = scalar <FILE>;
if ($curline =~ /TR10 setfont/ ) {
print TMP "TR10 setfont\n";
} else {
print TMP $curline
}
}
}

#empty the stack
@bols = (0);
}
}

if ($startpush == 1) {
push(@bols, tell(FILE)); # Beginning of *next* line
}

}

# Write the left lines to file
if ($#bols > 1) {
foreach $bol (@bols) {
seek(FILE, $bol, 0) || die "seek: $!";
print TMP scalar <FILE>;
}

#empty the stack
@bols = (0);
}
close(TMP);
close(FILE);
}

rename("tmp2", "tmp");

#rename("tmp", "$rptfile");

Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com