Unix Shell - Awk: remove duplicate lines

This is Interesting: Free IT Magazines  
Home > Archive > Unix Shell > February 2005 > Awk: remove duplicate lines





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Awk: remove duplicate lines
John Cordes

2005-02-25, 5:59 pm

I have written an awk script which does some parsing of a web
statistics data file. Here it is

BEGIN { FS = "\n"; RS = "\n"; OFS = " "; ORS = "" }
/Number/ { z = split($1, fullname, " "); printf("%-50s", fullname[8]) }
/Yesterday/ { x = split($1, name2, " "); printf("%5s\n", name2[x]) }

Sorry, I'm sure it's pretty horrible - I'm very bad at scripting!

It works fine, producing output like this:
/Science/NSIS/ 32
/Science/NSIS/index.html 18
/Science/NSIS/ 32
/Science/NSIS/Home.html 9

(lines deleted)
I would like to remove the duplicate (3rd) line. But I don't want to
sort|uniq, since I want to preserve the existing sort order. The
following awk construct does the job:
{
{ line[$0] ++; if (line[$0] < 2) print }
}
I could run this last bit as an additional pipe after the first awk
script has done its job, but I thought it should be possible to
incorporate the second piece of awk code into the main script. This is
what I'm stuck on. It seems to me that I'd have to get awk back to
reading the first line again - is awk strictly single pass? Or is there
some other way to accomplish this?

Thanks,
John
Ed Morton

2005-02-25, 5:59 pm



John Cordes wrote:
> I have written an awk script which does some parsing of a web
> statistics data file. Here it is
>
> BEGIN { FS = "\n"; RS = "\n"; OFS = " "; ORS = "" }


By setting FS and RS to the same value, every record will just have 1
field. You only ever test for that one field, $1. You could instead
operate on $0 since, due to your RS and FS settings, they'll be
identical. Also, the default for RS is "\n" so at the end of the day you
could get rid of your FS and RS settings and just operate on $0 instead.

The default for OFS is " " so you don't need to set that.

You use printf for specifically formatting your output, so setting
ORS="" has no effect.

In other words you can delete this whole BEGIN section and just operate
on $0 instead.

> /Number/ { z = split($1, fullname, " "); printf("%-50s", fullname[8]) }
> /Yesterday/ { x = split($1, name2, " "); printf("%5s\n", name2[x]) }


As I mentioned above, you COULD just substitute $0 for $0 and that code
would work as-is, but instead of calling "split" to carev up the record
into space-separated fields, just set the FS appropriately to a signle
space. If you just set it to " " that has a special meaning of sequences
of spaces, so you need to set it to "[ ]" instead.

So, your code to this point could be simplified to:

BEGIN { FS = "[ ]" }
/Number/ { printf("%-50s", $8) }
/Yesterday/ { printf("%5s\n", $NF }

>
> Sorry, I'm sure it's pretty horrible - I'm very bad at scripting!
>
> It works fine, producing output like this:
> /Science/NSIS/ 32
> /Science/NSIS/index.html 18
> /Science/NSIS/ 32
> /Science/NSIS/Home.html 9
>
> (lines deleted)
> I would like to remove the duplicate (3rd) line.


Is it that the whole line has to be duplicated to warrant removal or
just the first field? Assuming it's the first field, just modify your
code to test for it already having been processed before printing:

BEGIN { FS = "[ ]" }
/Number/ { nr = $8 }
/Yesterday/ && !(nums[nr]++) { printf(""%-50s%5s\n", nr, $NF }

Regards,

Ed.

But I don't want to
> sort|uniq, since I want to preserve the existing sort order. The
> following awk construct does the job:
> {
> { line[$0] ++; if (line[$0] < 2) print }
> }
> I could run this last bit as an additional pipe after the first awk
> script has done its job, but I thought it should be possible to
> incorporate the second piece of awk code into the main script. This is
> what I'm stuck on. It seems to me that I'd have to get awk back to
> reading the first line again - is awk strictly single pass? Or is there
> some other way to accomplish this?
>
> Thanks,
> John

John Cordes

2005-02-25, 5:59 pm

Ed Morton wrote:

<skip>

> Is it that the whole line has to be duplicated to warrant removal or
> just the first field? Assuming it's the first field, just modify your
> code to test for it already having been processed before printing:
>
> BEGIN { FS = "[ ]" }
> /Number/ { nr = $8 }
> /Yesterday/ && !(nums[nr]++) { printf(""%-50s%5s\n", nr, $NF }


Ed,

Thanks so much for your greatly improved, simplified awk script -
works great! But as to the duplicate line question, no, it's the whole
line that has to be duplicated before I omit printing. My son took pity
on me and suggested storing what was to be printed into an associative
array, and test before printing to see if it had come up before. I
_think_ that's basically what you're up to in the code above...
Unfortunately it is not obvious to me how to modify to handle the whole
line situation that I need. I will try to do some more work on it, but
if there's a relatively straightforward fix I'd love to hear about it.

Thanks again,
John
Ed Morton

2005-02-25, 5:59 pm



John Cordes wrote:
> Ed Morton wrote:
>
> <skip>
>
>
>
> Ed,
>
> Thanks so much for your greatly improved, simplified awk script - works
> great! But as to the duplicate line question, no, it's the whole line
> that has to be duplicated before I omit printing. My son took pity on me
> and suggested storing what was to be printed into an associative array,
> and test before printing to see if it had come up before. I _think_
> that's basically what you're up to in the code above... Unfortunately it
> is not obvious to me how to modify to handle the whole line situation
> that I need. I will try to do some more work on it, but if there's a
> relatively straightforward fix I'd love to hear about it.


You just need to include both output fields in the array index, e.g.

BEGIN { FS = "[ ]" }
/Number/ { nr = $8 }
/Yesterday/ && !(nums[nr,$NF]++) { printf(""%-50s%5s\n", nr, $NF }

Regards,

Ed.
John Cordes

2005-02-25, 5:59 pm

Ed Morton wrote:
> You just need to include both output fields in the array index, e.g.
>
> BEGIN { FS = "[ ]" }
> /Number/ { nr = $8 }
> /Yesterday/ && !(nums[nr,$NF]++) { printf(""%-50s%5s\n", nr, $NF }
>
> Regards,
>
> Ed.


Ah - got it! Beautifully simple (well, sort of) once shown how.

Again,
thank you very much,
John
Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com