Unix Shell - Doing thing efficiently

This is Interesting: Free IT Magazines  
Home > Archive > Unix Shell > March 2004 > Doing thing efficiently





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Doing thing efficiently
xyz

2004-03-27, 1:36 pm

Suppose I have some big list of filenames (for example coming from the
output
of the "find" command), and I have to do some kind of processing for
each of
them, and then output the result (the result is still a long list, but
each
line is longer since the results of the processing have been
concatenated
with the file name).

If the processing to do is a single command (ie, running "rm" or
"touch" on
each file), this is easily solved with something like

find .... | xargs <command> | etc. etc.

Processes running: find, xargs, <command>, etc. Each one runs once
(and they
probably run concurrently).

If, on the other side, i want to build my output list running more
than one
command on each file (ie, I want the output lines to include the "wc
-l" for
the file concatenated with the "md5sum" for the same file), things
seem to
get harder.
The obvious solution (to me at last) is to pipe the list through some
other
function or script that, for each input line, runs the commands and
echoes the
results, something like (bash)

my_fun()
{
while read f; do
part1=`wc -l $f`
part2=`md5sum $f | cut -d ' ' -f 1`
echo "$part1 $part2"
done
}

find .... | my_fun | etc.etc.

But this means that, if the original list has 50000 filenames, the
commands
wc", "md5sum" and "cut" run 50000 times each. This seems quite
inefficient to
me. Is there some other (more efficient) way to run the pipeline? I've
not been able to find such a solution, but I'm no shell scripting
guru, so I'm asking on this list to make sure the answer is actually
"no". If I'm asking
something that really doesn't make sense (or I'm missing something
obvious),
please let me know.

Thanks.
Alan Connor

2004-03-27, 2:35 pm

On 27 Mar 2004 10:30:42 -0800, xyz <persson@katamail.com> wrote:
>
>
> Suppose I have some big list of filenames (for example coming from the
> output
> of the "find" command), and I have to do some kind of processing for
> each of
> them, and then output the result (the result is still a long list, but
> each
> line is longer since the results of the processing have been
> concatenated
> with the file name).
>
> If the processing to do is a single command (ie, running "rm" or
> "touch" on
> each file), this is easily solved with something like
>
> find .... | xargs <command> | etc. etc.
>


How about :

find ......... -exec <command> {} \;

??

But the difference in time between the two is substantial:

with xargs -

real 0m0.189s
user 0m0.050s
sys 0m0.000s

with -exec -

real 0m4.317s
user 0m2.300s
sys 0m1.760s

That surprises me...

AC




Alexis Huxley

2004-03-27, 3:37 pm

> If, on the other side, i want to build my output list running more
> than one
> command on each file (ie, I want the output lines to include the "wc
> -l" for


Not sure if this is exactly what you were after, but you can get
xargs to do multiple commands on the same file by using a combination
of 'sh -c ...' to get more that one command into the single command
xargs is will to execute for you, and xarg's -i option to embed the
files in the middle of the command instead of leaving xargs to stick
them on the end. e.g.:

find .... | xargs -iTHING sh -c "echo 'THING'; echo 'THING'"

I'm running the same stupid command twice in this example, but you're
free to do whatever you like of course :-)

The quotes around the last two THINGs (single or double quotes but
different to what the two echo commands are enclosed in) are only
needed if your find command will generate file names with spaces
in them.

This was using GNU xargs, so take care with other OSs.

Alexis

http://dione.no-ip.org/~alexis/
xyz

2004-03-28, 8:35 am

Alexis Huxley <ahuxley@gmx.net> wrote in message news:<slrnc6bnei.gs0.ahuxley@dione.no-ip.org>...

> Not sure if this is exactly what you were after, but you can get
> xargs to do multiple commands on the same file by using a combination
> of 'sh -c ...' to get more that one command into the single command
> xargs is will to execute for you, and xarg's -i option to embed the
> files in the middle of the command instead of leaving xargs to stick
> them on the end. e.g.:
>
> find .... | xargs -iTHING sh -c "echo 'THING'; echo 'THING'"
>
> I'm running the same stupid command twice in this example, but you're
> free to do whatever you like of course :-)


In short, I was looking for a way to run more than one command on each
filename in a long list, combine the output of all the commands
(possibliy with the original filename, if not already included in the
output of some of the commands), but without having to run each
command "n" times ("n" being the number of filenames in the initial
list).
I didn't think about the approach you suggest. However, seems to me
that "sh" runs only once, but each of the commands I put after the -c
still runs n times, so not what I was after.
The difficulty in my problem (and note, I don't know if this is a
correct approach at all) is that I want to use programs that don't
read from stdin (but rather expect their arguments on the command
line, like md5sum or wc -l: I don't want to do the checksum or line
count of the filelist itself, instead I want this info for each file)
in a pipeline, so I was wondering if there was a way to do some sort
of "multi-xargs" to more than one command at once (sorry for the poor
definition) and somehow recombine all the output afterwards. The goal
is to have each of the "target" commands (the commands xargs passes
the arguments to) run only once instead of n times. And of course,
without having to do more than one pass over the initial data.

As I said before, it's possible that I'm missing something obvoius or
that I'm taking a wrong approach to the problem.

Thanks for taking the time to answer my prevoius post, and thanks for
the suggestion.
xyz

2004-03-28, 8:35 am

Alan Connor <zzzzzz@xxx.yyy> wrote in message news:<NLk9c.3071$Dv2.550@newsread2.news.pas.earthlink.net>...

> How about :
>
> find ......... -exec <command> {} \;


This is good, but still, the output of the above is a list that can't
be easily piped to another command that uses only one part of each
line (ie, if <command> is md5sum, the output of is something like:

d41d8cd98f00b204e9800998ecf8427e file1
84bc0f7c8cd201c66c4632dd58d8ddae file2
e1169cd14d235dc96eedbaaa9c00ae8d file3
etc.

and this is not easily reused in a pipeline to calculate wc -l for
each file).

I could put there two or more -exec commands, but:
- each <command> will output on a separate line; and
- each <command> is still executed n times (n being the number of
files that find finds).

Many thanks for the reply. As I said before, it's entirely possible
(and maybe likely) that I'm taking the wrong approach to the problem
(ie, I don't have to use pipelines at all, or something else).
Alexis Huxley

2004-03-28, 11:35 am

> In short, I was looking for a way to run more than one command on each
> filename in a long list, combine the output of all the commands
> (possibliy with the original filename, if not already included in the
> output of some of the commands), but without having to run each
> command "n" times ("n" being the number of filenames in the initial
> list).


Ahh ... okay. I think I understand now. Then how about :

{ echo 1; echo 2; echo 3; } | xargs sh -c 'echo "$@"; echo "$@"' DUMMY

The '{ .... }' is like your 'find' in that it generates one item per
line. But xargs *anyway* sends more than one line's worth to the command
specified as its (as xargs's) parameter. The 'sh -c ..... DUMMY' is
a way to embed multiple commands in a single command in a way that
it takes command line parameters.

If the 'echo's were being passed only one parameter at a time then
you would see:

1
1
2
2
3
3

and if they were being passed multiple file names at a time then you
would see:

1 2 3
1 2 3

which is what you do see. So in your case, your md5sum or whatever is
invoked fewer times, which I think is what you want right? And your
other command gets the same chunk of file names immediately after.

You might also try something like:

.... | xargs sh -c 'echo "$@" &
echo "$@" &
wait' DUMMY

so you could run stuff in parallel, but then the output *could* be
a bit messed up and difficult to recombine. Maybe you redirect
the echo's with '>> /tmp/first-one' and '>> /tmp/second-one' and
then use 'join' to recombine them.

Alexis
Chris F.A. Johnson

2004-03-28, 5:35 pm

On Sat, 27 Mar 2004 at 18:30 GMT, xyz wrote:

> Suppose I have some big list of filenames (for example coming from
> the output of the "find" command), and I have to do some kind of
> processing for each of them, and then output the result (the result
> is still a long list, but each line is longer since the results of
> the processing have been concatenated with the file name).


> If the processing to do is a single command (ie, running "rm" or
> "touch" on each file), this is easily solved with something like


>
> find .... | xargs <command> | etc. etc.


> Processes running: find, xargs, <command>, etc. Each one runs once
> (and they probably run concurrently).


> If, on the other side, i want to build my output list running more
> than one command on each file (ie, I want the output lines to
> include the "wc -l" for the file concatenated with the "md5sum" for
> the same file), things seem to get harder.


> The obvious solution (to me at last) is to pipe the list through
> some other function or script that, for each input line, runs the
> commands and echoes the results, something like (bash)


> my_fun()
> {
> while read f; do
> part1=`wc -l $f`
> part2=`md5sum $f | cut -d ' ' -f 1`
> echo "$part1 $part2"
> done
> }
>
> find .... | my_fun | etc.etc.
>


> But this means that, if the original list has 50000 filenames, the
> commands wc", "md5sum" and "cut" run 50000 times each. This seems
> quite inefficient to me. Is there some other (more efficient) way to
> run the pipeline? I've not been able to find such a solution, but
> I'm no shell scripting guru, so I'm asking on this list to make sure
> the answer is actually "no". If I'm asking something that really
> doesn't make sense (or I'm missing something obvious), please let me
> know.


Using a shell with arrays (e.g., bash or ksh), store the list of
files in an array:

IFS='
'
TAB=" " ## a literal TAB
LIST=( `find ....` ) ## not ksh88


Then process the list with each command, storing the result in a
file:

wc -l "${LIST[@]}" | awk '{print $1 "\t" $2 }' > wc-list
md5sum "${LIST[@]}" | tr -s ' ' '\t' > md5-list

Then use join to combine them:

join -t "$TAB" -j 2 md5-list wc-list | cut -f2-

If you have 50,000 files, this isn't going to work as is (the
command line will be too long), but you can use the same principle
(with xargs, for example).

--
Chris F.A. Johnson http://cfaj.freeshell.org/shell
========================================
===========================
My code (if any) in this post is copyright 2004, Chris F.A. Johnson
and may be copied under the terms of the GNU General Public License
John DuBois

2004-03-28, 6:34 pm

In article <b972aef1.0403271030.5ef63421@posting.google.com>,
xyz <persson@katamail.com> wrote:
....
>If, on the other side, i want to build my output list running more than one
>command on each file (ie, I want the output lines to include the "wc -l" for
>the file concatenated with the "md5sum" for the same file), things seem to get
>harder. The obvious solution (to me at last) is to pipe the list through some
>other function or script that, for each input line, runs the commands and
>echoes the results, something like (bash)
>
>my_fun()
>{
> while read f; do
> part1=`wc -l $f`
> part2=`md5sum $f | cut -d ' ' -f 1`
> echo "$part1 $part2"
> done
>}
>
>find .... | my_fun | etc.etc.
>
>But this means that, if the original list has 50000 filenames, the
>commands
>wc", "md5sum" and "cut" run 50000 times each. This seems quite
>inefficient to
>me. Is there some other (more efficient) way to run the pipeline?


Your main problem will be synchronization. If any of the commands you run
doesn't produce exactly one output line per file (for example, 0 if it has a
problem with a file), its output will cease to be synchronized with the others.
I don't do this unless I can either ensure such synchronization, or ensure that
synchronization failure will be detected (e.g. by having each command prefix
each output line with the filename, and then in the collating phase comparing
the filenames). But in any case, here's one approach, using ksh coprocesses
(this can be generalized using multiple coprocesses):

find . ! -type d | xargs -e -i ksh -c '
md5sum {} | cut -d " " -f 1 |&
wc -l {} | while read wcout; do
read -p md5out
print -r -- "$wcout $md5out"
done'

John
--
John DuBois spcecdt@armory.com KC6QKZ/AE http://www.armory.com/~spcecdt/
Stachu 'Dozzie' K.

2004-03-29, 7:39 am

Dnia 2004-03-27, Alan Connor napisal:
> But the difference in time between the two is substantial:
>
> with xargs -
>
> real 0m0.189s
> user 0m0.050s
> sys 0m0.000s
>
> with -exec -
>
> real 0m4.317s
> user 0m2.300s
> sys 0m1.760s
>
> That surprises me...

GNU xargs executes command only once with all of the input, find
executes command once per argument, so command `ls -1 | xargs echo'
produces one line, not few.
But xargs splits lines with spaces into few arguments to command unless
you give -i parameter (man xargs)

--
Stanislaw Klekot
Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com