|
Home > Archive > Unix Shell > August 2007 > awk comparing two files problem
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
awk comparing two files problem
|
|
| rachit7@gmail.com 2007-08-29, 7:26 pm |
| Hello,
I got this code from one the the groups. Can somebody explain me what
exactly is happening here and how is that happening? Please provide a
detail explation. I am not good at awk when more than one files are
involved.
------------------------------------------------------------------------------------------------------
you want all the lines in file1 that are not in file2,
awk 'NR == FNR { a[$0]; next } !($0 in a)' file2 file1
If you want only uniq lines in file1 that are not in file2,
awk 'NR == FNR { a[$0]; next } !($0 in a) { print; a[$0] }'
file2
file1
---------------------------------------------------------------------------------------------------------
-RB
| |
| Michael Tosch 2007-08-29, 7:26 pm |
| rachit7@gmail.com wrote:
> Hello,
>
> I got this code from one the the groups. Can somebody explain me what
> exactly is happening here and how is that happening? Please provide a
> detail explation. I am not good at awk when more than one files are
> involved.
> ------------------------------------------------------------------------------------------------------
> you want all the lines in file1 that are not in file2,
> awk 'NR == FNR { a[$0]; next } !($0 in a)' file2 file1
$0 is the current line.
A is an associative (contents-addressed) array.
NR==FNR is true for the 1st file, a[$0] is defined, and the
"next" skips the rest.
Otherwise (2nd file) the !($0 in a) is executed.
I am not familiar with it. Maybe it must loop through
the entire array, which is slow.
I would have written this as
awk 'NR==FNR { a[$0]=1; next } a[$0]!=1' file2 file1
i.e. for NR==FNR define a[$0] with the value 1,
otherwise print $0 if a[$0] is not 1.
a[$0]!=1
is short for
a[$0]!=1 {print}
is short for
a[$0]!=1 {print $0}
is short for
{if (a[$0]!=1) {print $0}}
Or use a simple fgrep:
fgrep -xvf file2 file1
>
> If you want only uniq lines in file1 that are not in file2,
> awk 'NR == FNR { a[$0]; next } !($0 in a) { print; a[$0] }'
> file2
> file1
Previously it was short for {print}
This time it does print then defines a[$0], so the loop will not
print the same thing again.
I would have written this as
awk 'NR==FNR { a[$0]=1; next } a[$0]++==0' file2 file1
It uses the ++ post increment (after the comparison).
Because several ++ will increase the a[$0] value to 1,2,3,...
so I compare with 0, utilizing the fact
that awk treats a non-initialized value as 0 when compared
with a number.
--
Michael Tosch @ hp : com
| |
| Barry Margolin 2007-08-30, 1:19 am |
| In article <fb4o3c$qf$1@aken.eed.ericsson.se>,
Michael Tosch <eedmit@NO.eed.SPAM.ericsson.PLS.se> wrote:
> rachit7@gmail.com wrote:
>
> $0 is the current line.
> A is an associative (contents-addressed) array.
> NR==FNR is true for the 1st file, a[$0] is defined, and the
> "next" skips the rest.
More specifically, NR is a line number that keeps incrementing across
all the input files, while FNR is the line number within the current
file. While processing the first file, they're obviously going to be
the same. But when you start processing the second file, FNR starts
over at 1, while NR keeps growing.
>
> Otherwise (2nd file) the !($0 in a) is executed.
> I am not familiar with it. Maybe it must loop through
> the entire array, which is slow.
"<val> in <array>" tests whether <val> is a key of the associative
array. It doesn't have to loop through the array any more than a[$0]!=1
does. And it doesn't have to check whether the value of the array
element is specifically equal to 1, like your version does.
>
> I would have written this as
>
> awk 'NR==FNR { a[$0]=1; next } a[$0]!=1' file2 file1
>
> i.e. for NR==FNR define a[$0] with the value 1,
> otherwise print $0 if a[$0] is not 1.
--
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***
| |
| Ed Morton 2007-08-30, 1:19 am |
| Barry Margolin wrote:
> In article <fb4o3c$qf$1@aken.eed.ericsson.se>,
> Michael Tosch <eedmit@NO.eed.SPAM.ericsson.PLS.se> wrote:
>
>
>
>
> More specifically, NR is a line number that keeps incrementing across
> all the input files, while FNR is the line number within the current
> file. While processing the first file, they're obviously going to be
> the same. But when you start processing the second file, FNR starts
> over at 1, while NR keeps growing.
Right, so just watch out for the case where the first file can be empty.
If that can occur then instead of testing for "NR==FNR" to identify the
first file, you should use "FILENAME==ARGV[1]" (assuming ARGV[1] is the
first file and not just setting a variable).
Ed.
| |
| Michael Tosch 2007-08-30, 7:20 am |
| Ed Morton wrote:
> Barry Margolin wrote:
>
> Right, so just watch out for the case where the first file can be empty.
> If that can occur then instead of testing for "NR==FNR" to identify the
> first file, you should use "FILENAME==ARGV[1]" (assuming ARGV[1] is the
> first file and not just setting a variable).
>
> Ed.
Isn't this a bug?
Shouldn't awk increment FNR each times it opens a file?
--
Michael Tosch @ hp : com
| |
| Ed Morton 2007-08-30, 7:20 am |
| Michael Tosch wrote:
> Ed Morton wrote:
>
>
>
>
> Isn't this a bug?
> Shouldn't awk increment FNR each times it opens a file?
>
>
No. FNR is the "File Number of Records" so if a file contains zero
records, neither FNR nor NR should change.
Ed.
| |
| rachit7@gmail.com 2007-08-30, 1:20 pm |
| On Aug 30, 7:48 am, Ed Morton <mor...@lsupcaemnt.com> wrote:
> Michael Tosch wrote:
>
>
>
>
>
at[vbcol=seagreen]
a[vbcol=seagreen]
--------=AD-[vbcol=seagreen]
>
>
>
>
ing[vbcol=seagreen]
>
>
>
> No. FNR is the "File Number of Records" so if a file contains zero
> records, neither FNR nor NR should change.
>
> Ed.- Hide quoted text -
>
> - Show quoted text -
Great!! Thanks guys. That really helped.
-RB
|
|
|
|
|