Unix Programming - Finding common lines between text files

This is Interesting: Free IT Magazines  
Home > Archive > Unix Programming > July 2006 > Finding common lines between text files





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Finding common lines between text files
Henrik Goldman

2006-07-02, 7:24 pm

Hi,

If I have two text files a.txt and b.txt containing:

a.txt:
a
b
c
d
e

b.txt:
c
d
a

What tool would I use to get the common lines between these two files?

Thanks in advance.

-- Henrik


Måns Rullgård

2006-07-02, 7:24 pm

"Henrik Goldman" <henrik_goldman@mail.tele.dk> writes:

> Hi,
>
> If I have two text files a.txt and b.txt containing:


[...]

> What tool would I use to get the common lines between these two files?


man comm
man sort

--
Måns Rullgård
mru@inprovide.com
Henrik Goldman

2006-07-02, 7:24 pm

> man comm
> man sort


I've been using comm -3 until now. But I've had cases where tabs were
inserted in the resulting file which ended up with 2 lines with same content
but with one line being tab-prefixed.
This is not what I wanted.

I'll try sort and see if it helps.

Any examples would be apreciated though.

-- Henrik


Chris F.A. Johnson

2006-07-02, 7:24 pm

On 2006-07-02, Henrik Goldman wrote:
> Hi,
>
> If I have two text files a.txt and b.txt containing:
>
> a.txt:
> a
> b
> c
> d
> e
>
> b.txt:
> c
> d
> a
>
> What tool would I use to get the common lines between these two files?


grep -f b.txt a.txt

If b.txt was sorted:

comm -12 a.txt b.txt

--
Chris F.A. Johnson, author <http://cfaj.freeshell.org>
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
===== My code in this post, if any, assumes the POSIX locale
===== and is released under the GNU General Public Licence
Henrik Goldman

2006-07-02, 7:24 pm

>> If I have two text files a.txt and b.txt containing:

> grep -f b.txt a.txt
>
> If b.txt was sorted:
>
> comm -12 a.txt b.txt
>


Ups... in my confusion I actually asked the wrong question.
What I really want is to remove the duplicate content between the two files
and show all the differences.

Sorry about that. The solutions you posted works but gives me the opposite
answer of what I want.

Thanks.

-- Henrik


Chris F.A. Johnson

2006-07-02, 7:24 pm

On 2006-07-02, Henrik Goldman wrote:
>
>
> Ups... in my confusion I actually asked the wrong question.
> What I really want is to remove the duplicate content between the two files
> and show all the differences.
>
> Sorry about that. The solutions you posted works but gives me the opposite
> answer of what I want.


Then ask grep for the opposite:

grep -vf b.txt a.txt

--
Chris F.A. Johnson, author <http://cfaj.freeshell.org>
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
===== My code in this post, if any, assumes the POSIX locale
===== and is released under the GNU General Public Licence
Måns Rullgård

2006-07-02, 7:24 pm

"Chris F.A. Johnson" <cfajohnson@gmail.com> writes:

> On 2006-07-02, Henrik Goldman wrote:
>
> Then ask grep for the opposite:
>
> grep -vf b.txt a.txt


What happens if b.txt has very many lines?

--
Måns Rullgård
mru@inprovide.com
Bit Twister

2006-07-02, 7:24 pm

On Sun, 02 Jul 2006 21:48:18 +0100, Måns Rullgård wrote:
>
> What happens if b.txt has very many lines?


Takes longer to run, can cause less output,...
Chris F.A. Johnson

2006-07-02, 7:24 pm

On 2006-07-02, Måns Rullgård wrote:
> "Chris F.A. Johnson" <cfajohnson@gmail.com> writes:
>
>
> What happens if b.txt has very many lines?


It will take a long time.


sort a.txt b.txt | uniq -u


--
Chris F.A. Johnson, author <http://cfaj.freeshell.org>
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
===== My code in this post, if any, assumes the POSIX locale
===== and is released under the GNU General Public Licence
Henrik Goldman

2006-07-02, 7:24 pm

Another related problem is that my input data could have doubles.

For instance:

One text file could contain:
a
b
d
a
n
b
c

In this case a and b are represented multiple times. However it seems that
'uniq' util (on AIX 5.1) is not filtering them out. Perhaps I need to sort
the data before running with uniq?

Thanks in advance.
-- Henrik


Logan Shaw

2006-07-02, 7:24 pm

Henrik Goldman wrote:
>
> I've been using comm -3 until now. But I've had cases where tabs were
> inserted in the resulting file which ended up with 2 lines with same content
> but with one line being tab-prefixed.


Then they are not common lines; they are lines with common substrings.
That is a different problem.

You might try something like this:

comm -3 \
<( sed -e 's/^[ ]//g' a.txt | sort ) \
<( sed -e 's/^[ ]//g' b.txt | sort )

Note that in the square brackets there is a tab and a space. Also
note that the "<( cmd )" is a shell-specific construct. Works in
ksh and bash, IIRC, but not in straight sh. You can get the same
effect with temporary files, but if you can avoid temporary files,
that's cleaner and easier.

- Logan
Chris F.A. Johnson

2006-07-02, 7:24 pm

On 2006-07-02, Henrik Goldman wrote:
> Another related problem is that my input data could have doubles.


Related to what? Please include context.

> For instance:
>
> One text file could contain:
> a
> b
> d
> a
> n
> b
> c
>
> In this case a and b are represented multiple times. However it seems that
> 'uniq' util (on AIX 5.1) is not filtering them out. Perhaps I need to sort
> the data before running with uniq?


Of course; that's what uniq does:

$ man uniq

NAME
uniq - remove duplicate lines from a sorted file


--
Chris F.A. Johnson, author <http://cfaj.freeshell.org>
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
===== My code in this post, if any, assumes the POSIX locale
===== and is released under the GNU General Public Licence
Måns Rullgård

2006-07-02, 7:24 pm

"Chris F.A. Johnson" <cfajohnson@gmail.com> writes:

> On 2006-07-02, Måns Rullgård wrote:
>
> It will take a long time.
>
> sort a.txt b.txt | uniq -u


sort -u a.txt b.txt

--
Måns Rullgård
mru@inprovide.com
Logan Shaw

2006-07-02, 7:24 pm

Henrik Goldman wrote:
> Another related problem is that my input data could have doubles.
>
> For instance:
>
> One text file could contain:
> a
> b
> d
> a
> n
> b
> c
>
> In this case a and b are represented multiple times. However it seems that
> 'uniq' util (on AIX 5.1) is not filtering them out. Perhaps I need to sort
> the data before running with uniq?


Yes, as someone else has said, uniq only removes consecutive duplicates,
so it's necessary to have a sorted input in order to remove all duplicates.

If your data isn't sorted, you could run sort before passing it to uniq,
but if you're going to use sort, you might as well just do "sort -u".

Or, if you wish to preserve the order and still remove duplicates, you
can do that with a PERL one-liner:

perl -lne 'print unless $seen{$_}++;'

For example:

$ for i in a b d a n b c
> do
> echo "$i"
> done | PERL -lne 'print unless $seen{$_}++;'

a
b
d
n
c
$

- Logan
Chris F.A. Johnson

2006-07-02, 7:24 pm

On 2006-07-02, Logan Shaw wrote:
> Henrik Goldman wrote:
>
> Yes, as someone else has said, uniq only removes consecutive duplicates,
> so it's necessary to have a sorted input in order to remove all duplicates.
>
> If your data isn't sorted, you could run sort before passing it to uniq,
> but if you're going to use sort, you might as well just do "sort -u".
>
> Or, if you wish to preserve the order and still remove duplicates, you
> can do that with a PERL one-liner:
>
> perl -lne 'print unless $seen{$_}++;'


Or an awk one-liner:

x[$0]++ == 0 { print }


Which some people would foolishly shorten to:

!x[$0]++

--
Chris F.A. Johnson, author <http://cfaj.freeshell.org>
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
===== My code in this post, if any, assumes the POSIX locale
===== and is released under the GNU General Public Licence
Henrik Goldman

2006-07-03, 7:28 am

>> sort a.txt b.txt | uniq -u
>
> sort -u a.txt b.txt
>


Unfortunatly no go!
The symbols which are common in both files keep getting displayed.

Since I did not have more patience to fiddle around with these commands I
ended up writing my own tool for doing this. It took less then 20 min in C++
which was much less then the time I spent on trying these commands:

int main(int argc, char *argv[])

{

CSmartFILEPtr fp;

set<string> StringSet;

set<string>::iterator it;

char szLine[1024];

char *p;

if (argc != 3)

{

printf("One String Instance usage: oneinst [insertfile1] [erasefile2]\n");

return 1;

}

if ((fp = fopen(argv[1], "r")) == NULL)

{

printf("Error: Unable to open %s\n", argv[1]);

return 1;

}

// Read file 1

while (fgets(szLine, sizeof(szLine)-1, fp) != NULL)

{

if ((p = strstr(szLine, "\n")) != NULL) *p = '\0';

if ((p = strstr(szLine, "\r")) != NULL) *p = '\0';


p = strtok(szLine, " \t");


if (p == NULL) continue;


if (StringSet.find(p) == StringSet.end())

StringSet.insert(p);

}


if ((fp = fopen(argv[2], "r")) == NULL)

{

printf("Error: Unable to open %s\n", argv[2]);

return 1;

}

// Remove content from file 2

while (fgets(szLine, sizeof(szLine)-1, fp) != NULL)

{

if ((p = strstr(szLine, "\n")) != NULL) *p = '\0';

if ((p = strstr(szLine, "\r")) != NULL) *p = '\0';


p = strtok(szLine, " \t");


if (p == NULL) continue;


if (StringSet.find(p) != StringSet.end())

StringSet.erase(p);

}

for (it = StringSet.begin(); it != StringSet.end(); it++)

printf("%s\n", it->c_str());

return 0;

}

Note that CSmartFILEPtr is my own custom class. It can be replaced with
FILE * (and naturally you have to remember fclose() relevant places).

-- Henrik


William James

2006-07-03, 1:26 pm

Henrik Goldman wrote:
>
> Unfortunatly no go!
> The symbols which are common in both files keep getting displayed.
>
> Since I did not have more patience to fiddle around with these commands I
> ended up writing my own tool for doing this. It took less then 20 min in C++
> which was much less then the time I spent on trying these commands:
>
> int main(int argc, char *argv[])
>
> {
>
> CSmartFILEPtr fp;
>
> set<string> StringSet;
>
> set<string>::iterator it;
>
> char szLine[1024];
>
> char *p;
>
> if (argc != 3)
>
> {
>
> printf("One String Instance usage: oneinst [insertfile1] [erasefile2]\n");
>
> return 1;
>
> }
>
> if ((fp = fopen(argv[1], "r")) == NULL)
>
> {
>
> printf("Error: Unable to open %s\n", argv[1]);
>
> return 1;
>
> }
>
> // Read file 1
>
> while (fgets(szLine, sizeof(szLine)-1, fp) != NULL)
>
> {
>
> if ((p = strstr(szLine, "\n")) != NULL) *p = '\0';
>
> if ((p = strstr(szLine, "\r")) != NULL) *p = '\0';
>
>
> p = strtok(szLine, " \t");
>
>
> if (p == NULL) continue;
>
>
> if (StringSet.find(p) == StringSet.end())
>
> StringSet.insert(p);
>
> }
>
>
> if ((fp = fopen(argv[2], "r")) == NULL)
>
> {
>
> printf("Error: Unable to open %s\n", argv[2]);
>
> return 1;
>
> }
>
> // Remove content from file 2
>
> while (fgets(szLine, sizeof(szLine)-1, fp) != NULL)
>
> {
>
> if ((p = strstr(szLine, "\n")) != NULL) *p = '\0';
>
> if ((p = strstr(szLine, "\r")) != NULL) *p = '\0';
>
>
> p = strtok(szLine, " \t");
>
>
> if (p == NULL) continue;
>
>
> if (StringSet.find(p) != StringSet.end())
>
> StringSet.erase(p);
>
> }
>
> for (it = StringSet.begin(); it != StringSet.end(); it++)
>
> printf("%s\n", it->c_str());
>
> return 0;
>
> }
>
> Note that CSmartFILEPtr is my own custom class. It can be replaced with
> FILE * (and naturally you have to remember fclose() relevant places).
>
> -- Henrik


In newLISP:

(for (i -2 -1)
(push (chop (parse (read-file (main-args i) ) "\n")) files))
(set 'uncommon (difference (last files) (first files)))
(set 'uncommon (append uncommon (difference (first files)
(last files))))
(dolist (line uncommon) (println line))
(exit)

File 1:

common1
a
b
a
c
common2
b
d
common3

File 2:

e
f
g
h
g
common3
common2
common1

Output:

a
b
c
d
e
f
g
h

spibou@gmail.com

2006-07-03, 7:22 pm


Chris F.A. Johnson wrote:

> On 2006-07-02, Logan Shaw wrote:


>
> Or an awk one-liner:
>
> x[$0]++ == 0 { print }
>
>
> Which some people would foolishly shorten to:
>
> !x[$0]++


Why is !x[$0]++ foolish ? It seems to be working just fine.
Also couldn't the first version be shortened to x[$0]++ == 0 ?

Spiros Bousbouras

Chris F.A. Johnson

2006-07-03, 7:22 pm

On 2006-07-03, spibou@gmail.com wrote:
>
> Chris F.A. Johnson wrote:
>
>
>
> Why is !x[$0]++ foolish ? It seems to be working just fine.


It doesn't work in the original awk. Not that that usually counts
for much with me. The main reason appears below.

> Also couldn't the first version be shortened to x[$0]++ == 0 ?


It could, but I prefer to write understandable and maintainable
code. "Golf" programming has only novelty value.


--
Chris F.A. Johnson, author <http://cfaj.freeshell.org>
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
===== My code in this post, if any, assumes the POSIX locale
===== and is released under the GNU General Public Licence
spibou@gmail.com

2006-07-03, 7:22 pm


Chris F.A. Johnson wrote:

> On 2006-07-03, spibou@gmail.com wrote:
>
> It doesn't work in the original awk. Not that that usually counts
> for much with me. The main reason appears below.
>
>
> It could, but I prefer to write understandable and maintainable
> code. "Golf" programming has only novelty value.


For understandable code it would then make more sense to
write {print $0}. If someone knows that print without an argument
implies print $0 , then the chances are that they will also know that
an empty rule implies print $0.

What is "golf" programming ?

Spiros Bousbouras

Chris F.A. Johnson

2006-07-03, 7:22 pm

On 2006-07-03, spibou@gmail.com wrote:
>
> Chris F.A. Johnson wrote:
>
>
> For understandable code it would then make more sense to
> write {print $0}. If someone knows that print without an argument
> implies print $0 , then the chances are that they will also know that
> an empty rule implies print $0.


You have a point.

> What is "golf" programming ?


Writing a program using the fewest possible characters.

--
Chris F.A. Johnson, author <http://cfaj.freeshell.org>
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
===== My code in this post, if any, assumes the POSIX locale
===== and is released under the GNU General Public Licence
Logan Shaw

2006-07-04, 1:29 am

spibou@gmail.com wrote:
> For understandable code it would then make more sense to
> write {print $0}. If someone knows that print without an argument
> implies print $0 , then the chances are that they will also know that
> an empty rule implies print $0.


I've been using Unix for over 15 years and had no idea that an empty
action was even allowed. I did, however, know that "{ print }" prints
the whole line; that seems obvious enough once you've seen it. (What
else would it do?)

- Logan
William James

2006-07-04, 1:29 am

Chris F.A. Johnson wrote:[vbcol=seagreen]
> On 2006-07-03, spibou@gmail.com wrote:

This is somewhat cryptic. If I understand the post-increment operator
correctly, the above is equivalent to

x[$0] == 0 { print }
{ x[$0] += 1}

The straightforward way to see whether a key is in an array is
to use the "in" operator.

!($0 in a) { a[$0] = 0 ; print }

Of course, merely accessing a missing key's value adds the key
to the array:

!($0 in a) { a[$0] ; print }

However, this version may be harder to understand.

William James

2006-07-04, 1:29 am

Logan Shaw wrote:
> spibou@gmail.com wrote:
>
> I've been using Unix for over 15 years and had no idea that an empty
> action was even allowed.


You may have used Unix for 15 years, but you have not studied awk
for 15 minutes. Aside from function definitions, an awk program
consists of pairs like this:

<boolean expression> { <actions> }

If the boolean expression is omitted, the default is naturally 1
(TRUE). (It couldn't default to 0, now could it?)

# print the 2nd field of every record
{ print $2 }

If the actions are omitted, the default is naturally { print $0 }.
(What else would it be?)

# print every record that contains "foo bar"
/foo bar/

> I did, however, know that "{ print }" prints
> the whole line; that seems obvious enough once you've seen it. (What
> else would it do?)


Emit a newline, obviously. That's the way "puts" works in Ruby:

# print "foo" followed by a linefeed
puts 'foo'
# print a linefeed
puts

In newLISP:

# print "foo" followed by a linefeed
(println "foo")
# print a linefeed
(println)

But to do the same in awk you have to print an empty string:

print ""

>
> - Logan


spibou@gmail.com

2006-07-04, 1:29 am


William James wrote:

> Logan Shaw wrote:
>
> If the actions are omitted, the default is naturally { print $0 }.
> (What else would it be?)
>
> # print every record that contains "foo bar"
> /foo bar/


Yeah , examples like this are pretty common in awk
tutorials. Here of course it's not just the rule which is
missing but also what it is you're comparing against
the regex.

By the way I didn't express myself accurately in my quote
above. I said "empty rule" when I should have said "absence
of a rule". An empty rule ie {} means no action.

>
> Emit a newline, obviously. That's the way "puts" works in Ruby:


And that's also the way print works in BASIC. I was rather
confused the first time I used print without an argument on
awk and it didn't print a new line. It took me a while before
I thought of trying print "".

Spiros Bousbouras

Logan Shaw

2006-07-04, 7:24 am

William James wrote:
> Logan Shaw wrote:
>
> You may have used Unix for 15 years, but you have not studied awk
> for 15 minutes.


On the contrary. I've studied it for maybe an hour. Not in depth,
but just enough to learn how to do stuff like

awk -F: '{ t += $2 } END { print t }'

or

awk '$2 ~ /[0-9]/ { print }'

or

awk '$1 > 0 && $1 < 100 { print }'

> Aside from function definitions, an awk program
> consists of pairs like this:
>
> <boolean expression> { <actions> }
>
> If the boolean expression is omitted, the default is naturally 1
> (TRUE). (It couldn't default to 0, now could it?)


It could, but that would be less useful.

> # print the 2nd field of every record
> { print $2 }
>
> If the actions are omitted, the default is naturally { print $0 }.
> (What else would it be?)


A syntax error. At least, that's what I had assumed it was.

For what it's worth, I learned "awk" by reading through the manual
page. No wait, by skimming the manual page, looking for just enough
knowledge to solve the problem at hand. So even if it does say in
the manual page that either can be omitted, I didn't notice.

- Logan
Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com