|
Home > Archive > Unix Shell > January 2006 > Reuse stdout
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
|
| I frequently tar a big catalog (5-15 GB) and then splits into smaller
pieces. Finally I md5 both the original tar-file and the resulting
segments. Something like this if I do it one command by one:
% tar cf dir.tar dir
% split -b 100m dir.tar dir.tar.
% ls
dir.tar
dir.tar.aa
dir.tar.ab
dir.tar.ac
% md5 dir.tar
% md5 dir.tar.*
This process is a little 'long-winded'. Besides it generates a lot of
unnecessary disk I/O which takes a lot of time since I do this on a
laptop with a relatively slow drive. I have written the following
'one-liner':
% tar cf - dir | tee >(md5 >&2) | split -b 10m - dir.tar.; md5 dir.tar.*
which improves the situation but the last step still requires disk IO.
What I would like to do is to pipe the result from tar into _both_ md5
and split and that then from split both pipe into md5 and save to disk,
in other words, reuse stdout from split. Is this doable?
Just to clarify: after the process I want md5-sums printed for all
files, both the original tar (so that the receiver can verify his tar
when he joins the files) and the segments, to the terminal. Furthermore,
I want the segments stored in pwd (or whatever directory I decide)
I would very much appreciate a more elegant 'one-liner' :-)
--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
| |
| Stephane CHAZELAS 2006-01-24, 6:23 pm |
| 2006-01-24, 20:25(+01), david:
[...]
> % tar cf - dir | tee >(md5 >&2) | split -b 10m - dir.tar.; md5 dir.tar.*
>
> which improves the situation but the last step still requires disk IO.
> What I would like to do is to pipe the result from tar into _both_ md5
> and split and that then from split both pipe into md5 and save to disk,
> in other words, reuse stdout from split. Is this doable?
>
> Just to clarify: after the process I want md5-sums printed for all
> files, both the original tar (so that the receiver can verify his tar
> when he joins the files) and the segments, to the terminal. Furthermore,
> I want the segments stored in pwd (or whatever directory I decide)
>
>
> I would very much appreciate a more elegant 'one-liner' :-)
[...]
You can't do it with split as split choses the names of the
files and opens them by itself. You could use a loop that calls
dd to read 10mb at a time and redirect its output to tee, but
you'd need to have dd read one character at a time because it
reads from a pipe.
tar cf - dir | tee >(md5 >&2) | {
c=1
while :; do
dd ibs=1 obs=4096 count=10485760 | tee "dir.tar.$c" | md5 >&2
[ -s "dir.tar.$c" ] || break
c=$(($c + 1))
done
}
rm "dir.tar.$c"
--
Stéphane
| |
| Bill Marcum 2006-01-24, 6:23 pm |
| On Tue, 24 Jan 2006 20:25:50 +0100, david
<messages.from.usenetREMOVETHIS@gmail.com> wrote:
>
> % tar cf - dir | tee >(md5 >&2) | split -b 10m - dir.tar.; md5 dir.tar.*
>
> which improves the situation but the last step still requires disk IO.
> What I would like to do is to pipe the result from tar into _both_ md5
> and split and that then from split both pipe into md5 and save to disk,
> in other words, reuse stdout from split. Is this doable?
>
The output of split is not stdout, it is a bunch of files. You might
write a loop using dd, something like this (untested):
{ n=1; while dd bs=1M count=10 | tee dir.tar.$n | md5; do
n=$((n+1)); done; }
--
Never ask the barber if you need a haircut.
| |
| Michael Paoli 2006-01-25, 2:58 am |
| david wrote:
> I frequently tar a big catalog (5-15 GB) and then splits into smaller
> pieces. Finally I md5 both the original tar-file and the resulting
> segments. Something like this if I do it one command by one:
>
> % tar cf dir.tar dir
> % split -b 100m dir.tar dir.tar.
> % ls
> dir.tar
> dir.tar.aa
> dir.tar.ab
> dir.tar.ac
>
> % md5 dir.tar
> % md5 dir.tar.*
>
> This process is a little 'long-winded'. Besides it generates a lot of
> unnecessary disk I/O which takes a lot of time since I do this on a
> laptop with a relatively slow drive. I have written the following
> 'one-liner':
>
> % tar cf - dir | tee >(md5 >&2) | split -b 10m - dir.tar.; md5 dir.tar.*
>
> which improves the situation but the last step still requires disk IO.
> What I would like to do is to pipe the result from tar into _both_ md5
> and split and that then from split both pipe into md5 and save to disk,
> in other words, reuse stdout from split. Is this doable?
>
> Just to clarify: after the process I want md5-sums printed for all
> files, both the original tar (so that the receiver can verify his tar
> when he joins the files) and the segments, to the terminal. Furthermore,
> I want the segments stored in pwd (or whatever directory I decide)
>
> I would very much appreciate a more elegant 'one-liner' :-)
Well, it might be more readable if it's more than one line ;-)
First, some useful hints:
named pipes
tee
due to spilt's handling of opening output files, it might not be the
optimal tool for doing an I/O efficient read and md5sum operation. A
"poor man's" quick implementation of something similar to split might
fill the bill fairly well. Anyway, as rough quick example (and it can
certainly be improved - like removing the first (and final) empty file
that it creates from its "split" operation:
$ cat foo
#!/bin/sh
mknod p p &&
mknod p2 p &&
{
<p >.tar.md5 md5sum &
(cd /bin && tar -cf - .) |
tee p |
(
n=1
while :
do
<p2 >.tar.$n.md5 md5sum &
dd bs=4096 count=10 |
tee p2 >tar.$n
[ -s tar.$n ] || break
n=`expr $n + 1`
done
)
}
$ ./foo
....
$ echo $?
0
$ cat .tar.md5
9a7cde66b3ac90db9c19bf1f52760024 -
$ cat tar.[1-9] tar.[1-9][0-9] tar.[1-9][0-9][0-9] | md5sum
9a7cde66b3ac90db9c19bf1f52760024 -
$ cat tar.[1-9] tar.[1-9][0-9] tar.[1-9][0-9][0-9] |
no errors detected
$ cat tar.[1-9] tar.[1-9][0-9] tar.[1-9][0-9][0-9] | tar -tf - | wc -l
112
$ for tmp in tar.[1-9] tar.[1-9][0-9] tar.[1-9][0-9][0-9][vbcol=seagreen]
> do md5sum <"$tmp" | cmp - ."$tmp".md5; done; unset tmp
$
Anyway, I fairly commonly use tee and named pipes to avoid redundant
disk I/O operations (e.g. write a large file, and compute md5sum and
sha1 hashes of the data that should be written in the file, all in a
single pass).
| |
| Stephane CHAZELAS 2006-01-25, 8:37 am |
| 2006-01-24, 23:03(-08), Michael Paoli:
[...]
> <p >.tar.md5 md5sum &
> (cd /bin && tar -cf - .) |
> tee p |
> (
> n=1
> while :
> do
> <p2 >.tar.$n.md5 md5sum &
> dd bs=4096 count=10 |
dd will do 10 read(2)s from the pipe. If there are not 4096
bytes at the time of the reads in the pipe, then dd will read
less than 40k. The only way to prevent that is to have ibs=1.
Or to be sure that tee writes a multiple of 4096 bytes at a
time (but less than PIPE_BUF) to its standard output.
As I beleive tee will write as much data as it will read, that
means that tar must write multiples of 4096 bytes.
On this Linux box, GNU tar doesn't:
$ strace -s0 -x write tar cf - . | cat > /dev/null
[...]
write(1, ""..., 10240) = 10240
write(1, ""..., 10240) = 10240
write(1, ""..., 10240) = 10240
write(1, ""..., 10240) = 10240
but outputs more than PIPE_BUF, and more than tee's read size:
$ nice -n 20 tar cf - . | strace -s0 -e read,write tee | cat > /dev/null
[...]
read(0, ""..., 8192) = 8192
write(1, ""..., 8192) = 8192
read(0, ""..., 8192) = 2048
write(1, ""..., 2048) = 2048
read(0, ""..., 8192) = 8192
write(1, ""..., 8192) = 8192
read(0, ""..., 8192) = 2048
$ nice -n 20 tar cf - . | nice -n 10 tee |
strace -s0 -e read,write dd bs=4096 count=10 > /dev/null
[...]
read(0, ""..., 4096) = 4096
write(1, ""..., 4096) = 4096
read(0, ""..., 4096) = 4096
write(1, ""..., 4096) = 4096
read(0, ""..., 4096) = 2048
write(1, ""..., 2048) = 2048
read(0, ""..., 4096) = 4096
write(1, ""..., 4096) = 4096
read(0, ""..., 4096) = 4096
write(1, ""..., 4096) = 4096
read(0, ""..., 4096) = 2048
write(1, ""..., 2048) = 2048
read(0, ""..., 4096) = 4096
write(1, ""..., 4096) = 4096
read(0, ""..., 4096) = 4096
write(1, ""..., 4096) = 4096
read(0, ""..., 4096) = 2048
write(1, ""..., 2048) = 2048
read(0, ""..., 4096) = 4096
write(1, ""..., 4096) = 4096
As you can see, 34 kB were written instead of 40.
It looks like it could be OK when using ibs=2048 (the greatest
common denominator of 8192, 10240 and PIPE_BUF) on that
particular system. But I would rather stick with dd's default of
512 bytes, and remember that only ibs=1 guarantees the number of
bytes read from a pipe by dd.
--
Stéphane
|
|
|
|
|