Unix Shell - sed performance

This is Interesting: Free IT Magazines  
Home > Archive > Unix Shell > August 2007 > sed performance





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author sed performance
flub

2007-08-21, 7:22 am

Hi

Suppose you need to do something like:

sed -e 's/^some_string/some_other_string/'

Where say 1 in 3 lines in the file match the pattern. Is it worth to
do this?

sed -e '/^some_/s/^some_string/some_other_string/"

You can suppose that none of the lines that are not affected will
match "^some_" here. But I just can't decide which to use, the first
seems slightly cleaner and easier to read while the other is a little
more specific. Will there be any performance gain using the second?
Any other arguments for one of the forms?

Cheers
Floris

Stachu 'Dozzie' K.

2007-08-21, 7:22 am

On 21.08.2007, flub <floris.bruynooghe@gmail.com> wrote:
> Suppose you need to do something like:
>
> sed -e 's/^some_string/some_other_string/'
>
> Where say 1 in 3 lines in the file match the pattern. Is it worth to
> do this?
>
> sed -e '/^some_/s/^some_string/some_other_string/"
>
> You can suppose that none of the lines that are not affected will
> match "^some_" here. But I just can't decide which to use, the first
> seems slightly cleaner and easier to read while the other is a little
> more specific. Will there be any performance gain using the second?


How big is the file you want to operate on? If it's 3 lines long, then
you won't notice any performance gain. It's even possible that the
second case will be slower (by few CPU cycles) since the /^some_/ will
have to be converted to finite automata. I think the first is fast
enough, simpler and more convenient.

> Any other arguments for one of the forms?


--
Secunia non olet.
Stanislaw Klekot
Floris Bruynooghe

2007-08-21, 7:22 am

On Aug 21, 10:56 am, "Stachu 'Dozzie' K."
<doz...@dynamit.im.pwr.wroc.pl.nospam> wrote:
> On 21.08.2007, flub <floris.bruynoo...@gmail.com> wrote:
>
>
>
>
>
>
> How big is the file you want to operate on? If it's 3 lines long, then
> you won't notice any performance gain. It's even possible that the
> second case will be slower (by few CPU cycles) since the /^some_/ will
> have to be converted to finite automata. I think the first is fast
> enough, simpler and more convenient.


Well, I was thinking of a few thousend lines. Seems predictable that
the second will be slower for only a few lines. I tend to go for the
first too, from a readability point of view.

Regards
Floris

Stachu 'Dozzie' K.

2007-08-21, 7:22 am

On 21.08.2007, Floris Bruynooghe <floris.bruynooghe@gmail.com> wrote:
> On Aug 21, 10:56 am, "Stachu 'Dozzie' K."
><doz...@dynamit.im.pwr.wroc.pl.nospam> wrote:
>
> Well, I was thinking of a few thousend lines.


That is, file with ~4kB size. No performance gain at all. It will be
swallowed by the time of fork()+exec() system calls (calling external
utilities is quite expensive).

> Seems predictable that
> the second will be slower for only a few lines. I tend to go for the
> first too, from a readability point of view.


--
Secunia non olet.
Stanislaw Klekot
Barry Margolin

2007-08-22, 1:22 am

In article <1187689476.154544.149790@k79g2000hse.googlegroups.com>,
flub <floris.bruynooghe@gmail.com> wrote:

> Hi
>
> Suppose you need to do something like:
>
> sed -e 's/^some_string/some_other_string/'
>
> Where say 1 in 3 lines in the file match the pattern. Is it worth to
> do this?
>
> sed -e '/^some_/s/^some_string/some_other_string/"
>
> You can suppose that none of the lines that are not affected will
> match "^some_" here. But I just can't decide which to use, the first
> seems slightly cleaner and easier to read while the other is a little
> more specific. Will there be any performance gain using the second?
> Any other arguments for one of the forms?


Why guess -- time the two methods and see if there's any difference.

--
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***
bsh

2007-08-23, 7:21 am

"Florian 'flub' Bruynooghe" <floris.bruynoo...@gmail.com> wrote:
> sed -e 's/^some_string/some_other_string/'
> or:
> sed -e '/^some_/s/^some_string/some_other_string/"
> ...


In this particular example the answer is determinate: the first
example will
always be faster (and more efficient) than the second. Since the RE (a
string)
is a proper substring of the RE of the second, there can never be any
reason
to prefer the second version to the first.

The first example can be thought of being transformed into the
construct:

/^some_string/s//some_other_string/

where the empty RE of the substitution is the last parsed RE. If a
non-empty "s" RE is specified (even if identical) it is just an
additional
construct to parse and execute. The only reason one would use this
form
is if one desires to pre-filter the input buffer before applying a
distinctively _different_ matching RE to it.

BTW, sed(1) is (well, I'll say it) _awesomely_ fast considering
the complexity of REs -- and the nature of sed(1)'s implementation
of conventional NFA's having an innate susceptibility to some
problematic
corner cases. Greg Ubben wrote a faithful dc(1) emulator (an arbitrary
precision RPN calculator) written in distribution sed(1) which
outperforms
early releases of GNU sed(1). Amazing!

http://sed.sf.net/local/scripts/dc.sed

Like n/awk(1), the time spent in the pre-execution overhead of parsing
and construction of the NFAs can be nontrivial; the execution of the
actual lines of sed(1) can be neglible; therefore, for relatively
small
datasets (<<1Mb?) the difference of total time may be trivial. I see
this
behavior from past sed(1) scripts that I have written which are many
thousands of lines of code.

=Brian, author of the first sed(1) debugger ever written, that
_nobody_
has ever indicated been used: http://sed.sourceforge.net/local/debug/sd.sh.txt

Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com