Unix Shell - print one occurance for duplicate entries

This is Interesting: Free IT Magazines  
Home > Archive > Unix Shell > November 2005 > print one occurance for duplicate entries





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author print one occurance for duplicate entries
Harry

2005-11-28, 6:04 pm

I have a text file with lot of repetitive entries like these below.
What I want is, for those lines with same Image Orientation values,
just print the 1st entry (with the file name on previous line include).

Any help appreciated.
TIA

[Input File]
cat -n xx.txt
1 ./2/Image/w4028320/view0228.dcm
2 000005AA: (0020,0037) [DS/60]
"0.996306\0.00783452\0.0855103\0.0082795
5\0.982424\-0.186478 " ; Image Orientation (Patient)
3 ./2/Image/w4028320/view0229.dcm
4 000005AA: (0020,0037) [DS/60]
"0.996306\0.00783452\0.0855103\0.0082795
5\0.982424\-0.186478 " ; Image Orientation (Patient)
5 ./2/Image/w4028320/view0230.dcm
6 000005AA: (0020,0037) [DS/60]
"0.996306\0.00783452\0.0855103\0.0082795
5\0.982424\-0.186478 " ; Image Orientation (Patient)
7 ./2/Image/w4089355/view0007.dcm
8 0000062A: (0020,0037) [DS/30] "1\-0\-0\0\-0.052336\-0.998629 ";
Image
Orientation (Patient)
9 ./2/Image/w4089355/view0008.dcm
10 0000062A: (0020,0037) [DS/30] "1\-0\-0\0\-0.052336\-0.998629 ";
Image
Orientation (Patient)
11 ./2/Image/w4089355/view0009.dcm
12 0000062A: (0020,0037) [DS/30] "1\-0\-0\0\-0.052336\-0.998629 ";
Image
Orientation (Patient)

[Expected Output]
1 ./2/Image/w4028320/view0228.dcm
2 000005AA: (0020,0037) [DS/60]
"0.996306\0.00783452\0.0855103\0.0082795
5\0.982424\-0.186478 " ; Image Orientation (Patient)
3 ./2/Image/w4089355/view0007.dcm
4 0000062A: (0020,0037) [DS/30] "1\-0\-0\0\-0.052336\-0.998629 ";
Image
Orientation (Patient)

Ed Morton

2005-11-28, 8:49 pm

Harry wrote:
> I have a text file with lot of repetitive entries like these below.
> What I want is, for those lines with same Image Orientation values,
> just print the 1st entry (with the file name on previous line include).
>
> Any help appreciated.
> TIA
>
> [Input File]
> cat -n xx.txt
> 1 ./2/Image/w4028320/view0228.dcm
> 2 000005AA: (0020,0037) [DS/60]
> "0.996306\0.00783452\0.0855103\0.0082795
> 5\0.982424\-0.186478 " ; Image Orientation (Patient)
> 3 ./2/Image/w4028320/view0229.dcm
> 4 000005AA: (0020,0037) [DS/60]
> "0.996306\0.00783452\0.0855103\0.0082795
> 5\0.982424\-0.186478 " ; Image Orientation (Patient)
> 5 ./2/Image/w4028320/view0230.dcm
> 6 000005AA: (0020,0037) [DS/60]
> "0.996306\0.00783452\0.0855103\0.0082795
> 5\0.982424\-0.186478 " ; Image Orientation (Patient)
> 7 ./2/Image/w4089355/view0007.dcm
> 8 0000062A: (0020,0037) [DS/30] "1\-0\-0\0\-0.052336\-0.998629 ";
> Image
> Orientation (Patient)
> 9 ./2/Image/w4089355/view0008.dcm
> 10 0000062A: (0020,0037) [DS/30] "1\-0\-0\0\-0.052336\-0.998629 ";
> Image
> Orientation (Patient)
> 11 ./2/Image/w4089355/view0009.dcm
> 12 0000062A: (0020,0037) [DS/30] "1\-0\-0\0\-0.052336\-0.998629 ";
> Image
> Orientation (Patient)
>
> [Expected Output]
> 1 ./2/Image/w4028320/view0228.dcm
> 2 000005AA: (0020,0037) [DS/60]
> "0.996306\0.00783452\0.0855103\0.0082795
> 5\0.982424\-0.186478 " ; Image Orientation (Patient)
> 3 ./2/Image/w4089355/view0007.dcm
> 4 0000062A: (0020,0037) [DS/30] "1\-0\-0\0\-0.052336\-0.998629 ";
> Image
> Orientation (Patient)
>


This looks like it might be what you're trying to do:

awk 'NR%2{t=$0;next}!($3 in f){f[$3];print t"\n"$0}' file

but it's hard to say without more details and because of the
line-wrapping in your example input.


Regards,

Ed.
Harry

2005-11-29, 2:50 am

Ed Morton wrote...

>This looks like it might be what you're trying to do:
>
>awk 'NR%2{t=$0;next}!($3 in f){f[$3];print t"\n"$0}' file


Thanks Ed for your quick response.

This apparently cut away too many entry pairs that should stay.
The text file is about 40000 lines; it should not be assumed that
odd line #'s have filename and even line #'s have ImageOrientation.

N.B. when you wrote NR%2, are you trying to do the above odd/even
assumption? I'm not good in awk, so please excuse me if I guessed
it wrong.

I am trying the following pseudo codes; it got parse errors on
the if statements. But maybe someone can catch what I intended to
do and give me a fix.

#!/usr/bin/gawk -f
BEGIN {
getline
last_filename = $0
getline
last_Orient = $0
}
/dcm/ {
this_filename = $0
getline
this_Orient = $0
if this_Orient != last_Orient then {
print last_filename
print last_Orient
last_filename = this filename
last_Orient = this_Orient
}
}

Thanks

Ed Morton

2005-11-29, 2:50 am

Harry wrote:
> Ed Morton wrote...
>
>
>
>
> Thanks Ed for your quick response.
>
> This apparently cut away too many entry pairs that should stay.


It's using the 3rd field to identify duplicates. Based on your posted
script below it looks like you actually want to use the whole record
($0). That's very easily fixed (see below).

> The text file is about 40000 lines; it should not be assumed that
> odd line #'s have filename and even line #'s have ImageOrientation.
>
> N.B. when you wrote NR%2, are you trying to do the above odd/even
> assumption? I'm not good in awk, so please excuse me if I guessed
> it wrong.


That's correct.

> I am trying the following pseudo codes; it got parse errors on
> the if statements. But maybe someone can catch what I intended to
> do and give me a fix.
>
> #!/usr/bin/gawk -f
> BEGIN {
> getline
> last_filename = $0
> getline
> last_Orient = $0
> }
> /dcm/ {
> this_filename = $0
> getline
> this_Orient = $0
> if this_Orient != last_Orient then {
> print last_filename
> print last_Orient
> last_filename = this filename
> last_Orient = this_Orient
> }
> }
>
> Thanks
>


This will fix the syntax errors:

#!/usr/bin/gawk -f
BEGIN {
getline last_filename
getline last_Orient
}
/dcm/ {
this_filename = $0
getline this_Orient
if (this_Orient != last_Orient) {
print last_filename
print last_Orient
last_filename = this filename
last_Orient = this_Orient
}
}

but almost all solutions that use getline are bad awk style at best or,
more commonly, dangerously buggy or just plain wrong. Try this instead:

awk '/dcm/{t=$0;next}!($0 in f){f[$0];print t"\n"$0}' file

If that doesn't work, you'd be much better off trying to fix that (e.g.
by providing more information here) than trying to make your posted
script above work as it's starting way off on the wrong track.

Regards,

Ed.
Harry

2005-11-29, 2:50 am

Ed Morton wrote...

>but almost all solutions that use getline are bad awk style at best or,
>more commonly, dangerously buggy or just plain wrong. Try this instead:
>
>awk '/dcm/{t=$0;next}!($0 in f){f[$0];print t"\n"$0}' file
>
>If that doesn't work, you'd be much better off trying to fix that (e.g.
>by providing more information here) than trying to make your posted
>script above work as it's starting way off on the wrong track.


Thanks Ed. It's getting better.

See below a portion of input and corresponding output.

From the output, line 8 and line 10 are still duplicate (in terms of
the 6 values within the quotes), so are line 12 and 14.
The only different between line 8 and 10 are their 1st field different,
which should be ignored.

-- output (cat -n output) -- begin --
1 ./2/ImageDatabase/w4028320/view0114.dcm
2 000005A6: (0020,0037) [DS/60] "0.994367\-7.77195e-09\0.105995\0.
0229776\0.976221\-0.215558 " ;
3 ./2/ImageDatabase/w4028320/view0152.dcm
4 000005A0: (0020,0037) [DS/58] "-0.440021\0.89763\0.0253318\-
0.0137912\0.0214511\-0.999675" ;
5 ./2/ImageDatabase/w4028320/view0170.dcm
6 000005AA: (0020,0037) [DS/62] "0.999905\-1.11616e-10\-0.0137944\0.
000295905\0.99977\0.021449 " ;
7 ./2/ImageDatabase/w4028320/view0210.dcm
8 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
9 ./2/ImageDatabase/w4028320/view0221.dcm
10 000005A8: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
11 ./2/ImageDatabase/w4089355/view0007.dcm
12 0000062A: (0020,0037) [DS/30] "1\-0\-0\0\-0.052336\-0.998629 "
;
13 ./2/ImageDatabase/w4089355/view0010.dcm
14 0000062E: (0020,0037) [DS/30] "1\-0\-0\0\-0.052336\-0.998629 "
;
;
-- output (cat -n output) -- end --

-- portion of input (cat -n input) -- begin --
1 ./2/ImageDatabase/w4028320/view0114.dcm
2 000005A6: (0020,0037) [DS/60] "0.994367\-7.77195e-09\0.105995\0.
0229776\0.976221\-0.215558 " ;
3 ./2/ImageDatabase/w4028320/view0152.dcm
4 000005A0: (0020,0037) [DS/58] "-0.440021\0.89763\0.0253318\-
0.0137912\0.0214511\-0.999675" ;
5 ./2/ImageDatabase/w4028320/view0170.dcm
6 000005AA: (0020,0037) [DS/62] "0.999905\-1.11616e-10\-0.0137944\0.
000295905\0.99977\0.021449 " ;
7 ./2/ImageDatabase/w4028320/view0210.dcm
8 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
9 ./2/ImageDatabase/w4028320/view0211.dcm
10 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
11 ./2/ImageDatabase/w4028320/view0212.dcm
12 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
13 ./2/ImageDatabase/w4028320/view0213.dcm
14 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
15 ./2/ImageDatabase/w4028320/view0214.dcm
16 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
17 ./2/ImageDatabase/w4028320/view0215.dcm
18 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
19 ./2/ImageDatabase/w4028320/view0216.dcm
20 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
21 ./2/ImageDatabase/w4028320/view0217.dcm
22 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
23 ./2/ImageDatabase/w4028320/view0218.dcm
24 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
25 ./2/ImageDatabase/w4028320/view0219.dcm
26 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
27 ./2/ImageDatabase/w4028320/view0220.dcm
28 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
29 ./2/ImageDatabase/w4028320/view0221.dcm
30 000005A8: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
31 ./2/ImageDatabase/w4028320/view0222.dcm
32 000005A8: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
33 ./2/ImageDatabase/w4028320/view0223.dcm
34 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
35 ./2/ImageDatabase/w4028320/view0224.dcm
36 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
37 ./2/ImageDatabase/w4028320/view0225.dcm
38 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
39 ./2/ImageDatabase/w4028320/view0226.dcm
40 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
41 ./2/ImageDatabase/w4028320/view0227.dcm
42 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
43 ./2/ImageDatabase/w4028320/view0228.dcm
44 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
45 ./2/ImageDatabase/w4028320/view0229.dcm
46 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
47 ./2/ImageDatabase/w4028320/view0230.dcm
48 000005AA: (0020,0037) [DS/60] "0.996306\0.00783452\0.0855103\0.
00827955\0.982424\-0.186478 " ;
49 ./2/ImageDatabase/w4089355/view0007.dcm
50 0000062A: (0020,0037) [DS/30] "1\-0\-0\0\-0.052336\-0.998629 "
;
51 ./2/ImageDatabase/w4089355/view0008.dcm
52 0000062A: (0020,0037) [DS/30] "1\-0\-0\0\-0.052336\-0.998629 "
;
53 ./2/ImageDatabase/w4089355/view0009.dcm
54 0000062A: (0020,0037) [DS/30] "1\-0\-0\0\-0.052336\-0.998629 "
;
55 ./2/ImageDatabase/w4089355/view0010.dcm
56 0000062E: (0020,0037) [DS/30] "1\-0\-0\0\-0.052336\-0.998629 "
; ;
-- portion of input -- end --


Harry

2005-11-29, 2:50 am

Ed Morton wrote...

[vbcol=seagreen]
>but almost all solutions that use getline are bad awk style at best or,
>more commonly, dangerously buggy or just plain wrong. Try this instead:
>
>awk '/dcm/{t=$0;next}!($0 in f){f[$0];print t"\n"$0}' file



I shouldn't compare the whole line. You are right, when I combined your
two solutions together, it became this.

awk '/dcm/{t=$0;next}!($3 in f){f[$3];print t"\n"$0}' file

And it worked great when I examined a small portion of input vs
output. As I said, the input is 40000 lines long. I'll checked more
lines visually.

Thank you very much.


Harry

2005-11-29, 2:50 am

Harry wrote...
>
>Ed Morton wrote...
>
>
>
>
>I shouldn't compare the whole line. You are right, when I combined your
>two solutions together, it became this.
>
> awk '/dcm/{t=$0;next}!($3 in f){f[$3];print t"\n"$0}' file


BTW, it should be 4th field (the one with 6 values), not the 3rd field.

awk '/dcm/{t=$0;next}!($4 in f){f[$4];print t"\n"$0}' file

Thanks again.

Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com