Unix Shell - bash: processing files a chunk at a time / detecting stdin end of file

This is Interesting: Free IT Magazines  
Home > Archive > Unix Shell > December 2006 > bash: processing files a chunk at a time / detecting stdin end of file





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author bash: processing files a chunk at a time / detecting stdin end of file
sillyhat@yahoo.com

2006-12-02, 7:30 am

Hello, can someone please help.

I think I need a way of detecting stdin end of file with bash...

I have a large file, big.txt, that I want to process in chunks
*preferably* using bash. I want to take each chunk, process it and
write the output to a file.

The following works:-

dd if=big.txt ibs=98765 skip=0 count=1 | afilter > out.0.txt
dd if=big.txt ibs=98765 skip=1 count=1 | afilter > out.1.txt
dd if=big.txt ibs=98765 skip=2 count=1 | afilter > out.2.txt
dd if=big.txt ibs=98765 skip=3 count=1 | afilter > out.3.txt
dd if=big.txt ibs=98765 skip=4 count=1 | afilter > out.4.txt
dd if=big.txt ibs=98765 skip=5 count=1 | afilter > out.5.txt
dd if=big.txt ibs=98765 skip=6 count=1 | afilter > out.6.txt
dd if=big.txt ibs=98765 skip=7 count=1 | afilter > out.7.txt
dd if=big.txt ibs=98765 skip=8 count=1 | afilter > out.8.txt
dd if=big.txt ibs=98765 skip=9 count=1 | afilter > out.9.txt

(n.b.'afilter' is a program which works fine)
I could hack it into some sort of loop, something like:

for i in `seq 1 10`;
do
dd if=big.txt ibs=98765 skip=$i count=1 | afilter > out.$i.txt
done

However I would prefer that it scanned big.txt once rather than 10
times and I would prefer that it was more generic, detecting the end of
the file, perhaps something like this:

set x=0

cat big.txt | \
while [ not eof ] ; do
dd ibs=98765 count=1 | afilter > out.$x.txt ;
$x++;
done

I don't think the 'while [ not eof ]' will work.
Is it possible to massage the above to work? Perhaps something other
than dd is more appropriate?

Thanks in advance.
Hal

Stephane CHAZELAS

2006-12-02, 7:30 am

2006-12-2, 03:55(-08), sillyhat@yahoo.com:
[...]
> dd if=big.txt ibs=98765 skip=0 count=1 | afilter > out.0.txt
> dd if=big.txt ibs=98765 skip=1 count=1 | afilter > out.1.txt
> dd if=big.txt ibs=98765 skip=2 count=1 | afilter > out.2.txt
> dd if=big.txt ibs=98765 skip=3 count=1 | afilter > out.3.txt
> dd if=big.txt ibs=98765 skip=4 count=1 | afilter > out.4.txt
> dd if=big.txt ibs=98765 skip=5 count=1 | afilter > out.5.txt
> dd if=big.txt ibs=98765 skip=6 count=1 | afilter > out.6.txt
> dd if=big.txt ibs=98765 skip=7 count=1 | afilter > out.7.txt
> dd if=big.txt ibs=98765 skip=8 count=1 | afilter > out.8.txt
> dd if=big.txt ibs=98765 skip=9 count=1 | afilter > out.9.txt
>
> (n.b.'afilter' is a program which works fine)
> I could hack it into some sort of loop, something like:
>
> for i in `seq 1 10`;
> do
> dd if=big.txt ibs=98765 skip=$i count=1 | afilter > out.$i.txt
> done


No need to open and skip everytime.


{
repeat 10 dd ibs=98765 count=1 | afilter > out.$((i++)).txt
} < big.txt

(zsh syntax abov)


>
> However I would prefer that it scanned big.txt once rather than 10
> times and I would prefer that it was more generic, detecting the end of
> the file, perhaps something like this:
>
> set x=0
>
> cat big.txt | \
> while [ not eof ] ; do
> dd ibs=98765 count=1 | afilter > out.$x.txt ;
> $x++;
> done


To detect end-of-file, you need to check dd stderr and verify
that the first 4 characters are "0+0 ".

If the file is a text file, you can use bash's read -n or zsh's
read -k (zsh can cope with binary files as well though).

while read -u3 -k98765; do
print -rn -- $REPLY | afilter > out.$((x++)).txt
done 3< big.txt
if (($#REPLY)); then
# deal with extra characters if necessary
fi

POSIXly you'd do:

TAB=`printf '\t'`
read_chunk() {
{
LC_ALL=C dd ibs="$1" count=1 2>&1 >&3 3>&- | {
IFS=" $TAB+" read -r a b rest || return 3
if [ "$a" -eq 0 ]; then
if [ "$b" -eq 0 ]; then
ret=1 # end-of-file
else
ret=2 # fewer than $1 bytes returned
fi
else
ret=0
fi
cat > /dev/null
return "$ret"
}
} 3>&1
}

You'd do

ret=$(
exec 4>&1
{
read_chunk 98765; echo "$?" >&4
} | afilter > "out.$x.txt" 4>&-
)

$ret would indicate if eof was reached, but afilter would have
been called in that case as well.


--
Stéphane
Dave Gibson

2006-12-02, 1:16 pm

sillyhat@yahoo.com <sillyhat@yahoo.com> wrote:
> Hello, can someone please help.
>
> I think I need a way of detecting stdin end of file with bash...
>
> I have a large file, big.txt, that I want to process in chunks
> *preferably* using bash. I want to take each chunk, process it and
> write the output to a file.
>
> The following works:-
>
> dd if=big.txt ibs=98765 skip=0 count=1 | afilter > out.0.txt
> dd if=big.txt ibs=98765 skip=1 count=1 | afilter > out.1.txt

[...]
> dd if=big.txt ibs=98765 skip=8 count=1 | afilter > out.8.txt
> dd if=big.txt ibs=98765 skip=9 count=1 | afilter > out.9.txt


split -b 98765 -d big.txt out.
for f in out* ; do afilter < "$f" > "${f}.txt" && rm "$f" ; done
sillyhat@yahoo.com

2006-12-02, 1:16 pm

That's very helpful :O)

Testing dd's sterr does seem to do the trick.
Now I need to cater for binary files as well.

Using some of your suggestions, here's what I have arrived at in bash -
which I have to use for now:-

i=0
rc=1

{
while true
do
dd ibs=98765 count=1 2> emsg.txt | afilter > out.$i.bin
grep "0+0 " emsg.txt >/dev/null
if [ $? = 0 ]
then
rm emsg.txt out.$i.bin
break
fi
((i++))
done
} < big.bin


It would be nice to be able to avoid using the emsg.txt file but I
couldn't work out how to redirect stderr to grep (and then get stdout
into afilter!).

Hal

sillyhat@yahoo.com

2006-12-02, 1:16 pm

The split command certainly works well if you have lots of space and
was the way I was working initially.

However, I am actually restricted on space on my main machine and want
to literally process the big file a chunk at a time, shipping that
chunk off to another machine as I go along.

My initial query should have mentioned this.

Apologies.

Hal

Stephane CHAZELAS

2006-12-02, 1:16 pm

2006-12-2, 07:37(-08), sillyhat@yahoo.com:
> That's very helpful :O)
>
> Testing dd's sterr does seem to do the trick.
> Now I need to cater for binary files as well.
>
> Using some of your suggestions, here's what I have arrived at in bash -
> which I have to use for now:-
>
> i=0
> rc=1
>
> {
> while true
> do
> dd ibs=98765 count=1 2> emsg.txt | afilter > out.$i.bin
> grep "0+0 " emsg.txt >/dev/null
> if [ $? = 0 ]
> then
> rm emsg.txt out.$i.bin
> break
> fi
> ((i++))
> done
> } < big.bin
>
>
> It would be nice to be able to avoid using the emsg.txt file but I
> couldn't work out how to redirect stderr to grep (and then get stdout
> into afilter!).

[...]

The read_chunk() function I posted did just that.

You can do:

dd_stderr=$(
{
LC_ALL=C dd ibs=98765 count=1 2>&3 | afilter > "out.$i.bin" 3>&-
} 3>&1
)
case $dd_stderr in
"0+0 "*)
break;;
esac

To be POSIX compliant, you should also take into account dd
implementations that would output " 0 + 0\t" instead of 0+0 for
instance, hence my use of IFS=" $TAB+" read -r a b c

--
Stéphane
Stephane CHAZELAS

2006-12-02, 1:16 pm

2006-12-2, 07:37(-08), sillyhat@yahoo.com:
[...]
> {
> while true
> do
> dd ibs=98765 count=1 2> emsg.txt | afilter > out.$i.bin
> grep "0+0 " emsg.txt >/dev/null
> if [ $? = 0 ]
> then
> rm emsg.txt out.$i.bin
> break
> fi
> ((i++))
> done
> } < big.bin
>
>
> It would be nice to be able to avoid using the emsg.txt file but I
> couldn't work out how to redirect stderr to grep (and then get stdout
> into afilter!).

[...]

It may reveal easier to use perl:

perl -pe 'BEGIN{$/=\3}{open STDOUT, "|afilter > out.$..txt"}
' < big.bin


--
Stéphane
sillyhat@yahoo.com

2006-12-02, 1:16 pm

OK, my i/o redirection and bash skills in general are being stretched a
bit!

Posix and my afilter command aside, is the following correct/safe?

i=0
rc=1

{
while true
do
{
dd ibs=98765 count=1 2>&3 3>&- > out.$i.bin
} 3>&1 | grep "0+0 " >/dev/null
if [ $? = 0 ]
then
rm out.$i.bin
break
fi
((i++))
done
} < big.bin

Is it possible to get rid of the if command with something like:
...
{
dd ibs=98765 count=1 2>&3 3>&- > out.$i.bin
} 3>&1 | grep "0+0 " >/dev/null && (
rm out.$i.bin
break
)
...

I need to do some swotting!

Stephane CHAZELAS

2006-12-02, 1:16 pm

2006-12-2, 10:07(-08), sillyhat@yahoo.com:
> OK, my i/o redirection and bash skills in general are being stretched a
> bit!
>
> Posix and my afilter command aside, is the following correct/safe?
>
> i=0
> rc=1
>
> {
> while true
> do
> {
> dd ibs=98765 count=1 2>&3 3>&- > out.$i.bin
> } 3>&1 | grep "0+0 " >/dev/null


dd ibs=98765 count=1 2>&1 > "out.$i.bin" | grep -q '0+0 '

would have been enough. But with afilter:

{
dd ibs=98765 count=1 2>&3 | afilter > "out.$i.bin" 3>&-
} 3>&1 | grep -q '0+0 '


> if [ $? = 0 ]


[ $? = 0 ] (or more correctly [ "$?" -eq 0 ]) is a no-op.

The "if" structure in shells use the commands exit status.

if dd ibs=98765 count=1 2>&1 > "out.$i.bin" | grep -q '0+0 '
then rm ...

> then
> rm out.$i.bin
> break
> fi
> ((i++))


i=$(($i + 1))

is the portable equivalent.

> done
> } < big.bin
>
> Is it possible to get rid of the if command with something like:
> ...
> {
> dd ibs=98765 count=1 2>&3 3>&- > out.$i.bin
> } 3>&1 | grep "0+0 " >/dev/null && (
> rm out.$i.bin
> break
> )


Not sure about the break within a subshell, but you can do:

dd ibs=98765 count=1 2>&1 > "out.$i.bin" |
grep -q '0+0 ' || {
rm "out.$i.bin"
}

> ...
>
> I need to do some swotting!
>


Please note variable expansions must be quoted "$var" instead of
$var.

--
Stéphane
William Park

2006-12-03, 7:20 pm

sillyhat@yahoo.com wrote:
> Hello, can someone please help.
>
> I think I need a way of detecting stdin end of file with bash...
>
> I have a large file, big.txt, that I want to process in chunks
> *preferably* using bash. I want to take each chunk, process it and
> write the output to a file.
>
> The following works:-
>
> dd if=big.txt ibs=98765 skip=0 count=1 | afilter > out.0.txt
> ...


In general, it's difficult to process arbitrary binary file in shell.
I'm not sure whether lseek(2) and read(2) are available as user command.

--
William Park <opengeometry@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
Stephane CHAZELAS

2006-12-03, 7:20 pm

2006-12-03, 15:52(-05), William Park:
> sillyhat@yahoo.com wrote:
>
> In general, it's difficult to process arbitrary binary file in shell.
> I'm not sure whether lseek(2) and read(2) are available as user command.


Yes, that's dd. Some dd implementations even provide with
ftruncate(2).

exec 3<> some-file

dd count=0 bs=1 skip=1234 <&3

But only zsh can cope with binary files as all the other shells
can't cope with the NUL character (you may be able to work
around it using intermidiary text formats such as uuencode's
though).

zsh has a sysread builtin, and the mapfile associative array to
access the content of a file as you would of a variable (uses
mmap internally).

--
Stéphane
Stephane CHAZELAS

2006-12-03, 7:20 pm

2006-12-3, 21:07(+00), Stephane CHAZELAS:
> 2006-12-03, 15:52(-05), William Park:
>
> Yes, that's dd. Some dd implementations even provide with
> ftruncate(2).
>
> exec 3<> some-file
>
> dd count=0 bs=1 skip=1234 <&3

[...]

But dd doesn't provide anyway to seek backward. You need to
reopen the file to start from 0.

See skip/iseek to seek on input (and read), and seek to seek on
output and write (and trunc or not depending on whether
conv=notrunc is given or not).

--
Stéphane
William Park

2006-12-03, 7:20 pm

Stephane CHAZELAS <this.address@is.invalid> wrote:
> zsh has a sysread builtin, and the mapfile associative array to
> access the content of a file as you would of a variable (uses
> mmap internally).


Does the file grow and shrink, as you manipulate the variable? That is,
var=abc
var=qwerty
what happens to the file?

--
William Park <opengeometry@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
Stephane CHAZELAS

2006-12-03, 7:20 pm

2006-12-03, 17:41(-05), William Park:
> Stephane CHAZELAS <this.address@is.invalid> wrote:
>
> Does the file grow and shrink, as you manipulate the variable? That is,
> var=abc
> var=qwerty
> what happens to the file?


What you expect should happen:

~$ ls -ld a
ls: a: No such file or directory
~$ zmodload zsh/mapfile
~$ mapfile[a]=foo
~$ ls -ld a
-rw-r--r-- 1 chazelas chazelas 3 Dec 3 22:53 a
~$ mapfile[a]=foobar
~$ ls -ld a
-rw-r--r-- 1 chazelas chazelas 6 Dec 3 22:53 a
~$ mapfile[a]=$'bar\n'
~$ ls -ld a
-rw-r--r-- 1 chazelas chazelas 4 Dec 3 22:54 a

Though you can get the 12th to 23th bytes of "a" with print -rn
-- ${mapfile[a][12,23]}, I'm not sure how to assign something to
a byte range. That would be a flaw in zsh if you couldn't as you
can do:

scalar[12,23]=text

but can't seem to be able to do:

associative_array[key][12,23]=text

nor

array[3][12,23]=text

I'll ask on the zsh mailing list.

--
Stéphane
William Park

2006-12-03, 7:20 pm

Stephane CHAZELAS <this.address@is.invalid> wrote:
> 2006-12-03, 17:41(-05), William Park:
>
> What you expect should happen:
>
> ~$ ls -ld a
> ls: a: No such file or directory
> ~$ zmodload zsh/mapfile
> ~$ mapfile[a]=foo
> ~$ ls -ld a
> -rw-r--r-- 1 chazelas chazelas 3 Dec 3 22:53 a
> ~$ mapfile[a]=foobar
> ~$ ls -ld a
> -rw-r--r-- 1 chazelas chazelas 6 Dec 3 22:53 a
> ~$ mapfile[a]=$'bar\n'
> ~$ ls -ld a
> -rw-r--r-- 1 chazelas chazelas 4 Dec 3 22:54 a


Interesting. I recently added 'vfile' command to read/write file of the
same name as variable (code not yet submitted).
http://home.eol.ca/~parkw/index.html#vfile
Not as automatic as Zsh's mapfile, but still can survive logout/reboot.

Thanks for the reference. I'll look into what Zsh does.

The motivation is to manipulate table of data, where each field is
file, and each row is directory. So, to get the fields in row 1,
cd row1
vfile -r a b c ...
or
vfile -r -d row1 a b c ...

My main problem is array variable. At the moment, it's handled like
vfile -[rw] a b 'c[0]' 'c[1]'
Generating such element list is another painful scripting exercise,
which should be eliminated.

--
William Park <opengeometry@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
Kaz Kylheku

2006-12-04, 1:33 am

On Dec 2, 7:56 am, silly...@yahoo.com wrote:
> The split command certainly works well if you have lots of space and
> was the way I was working initially.
>
> However, I am actually restricted on space on my main machine and want
> to literally process the big file a chunk at a time, shipping that
> chunk off to another machine as I go along.


If you had mentioned that at the outset, it would have been obvious
that what you are asking for is silly.

Since you don't actually need to retain the pieces, the pieces can
actually be stored temporarily in memory only.

Q: What is a piece of memory called which stores successive pieces of a
file as it is being processed?
A: A buffer.

That is to say, what is the difference between buffered I/O and reading
chunks of a file, processing them, and shipping them off?

It sounds like the ``afilter'' program in the example that you gave:

> for i in `seq 1 10`;
> do
> dd if=big.txt ibs=98765 skip=$i count=1 | afilter > out.$i.txt
> done


is broken. Otherwise you could just do this:

afilter < big.txt | ship-off

where ship-off is some command that stores the results on the other
machine, for instance, using Secure Shell:

afilter < big.txt | ssh me@other-machine cat \> out-big.txt

Fix the problem in afilter, if you can.

Kaz Kylheku

2006-12-04, 1:33 am

On Dec 2, 3:55 am, silly...@yahoo.com wrote:
> (n.b.'afilter' is a program which works fine)


ROFL.

Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com