|
Home > Archive > Unix Shell > March 2004 > A more efficient approach to sed
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
A more efficient approach to sed
|
|
| null_device@ssl-mail.com 2004-03-26, 11:47 am |
| Hi all,
I am running qmail/ezmlm on RH9 with many mailing lists.
I have this script <appendage>, that I knocked up to apply some
exclusion logic to some mail lists, no problem.
Now, I have just had a few large-o-rama lists delivered to my desk
to be incorporated into this system.
Problem is, my script takes *too* long, to go through these lists.
For now, I am just running this script on various boxes in the background.
However, I want to discover a more elegant way of doing this.
Could anyone make any suggestions as to a more efficient way to use
sed/grep?
I have tried adding in the $match string at the start for example:
sed -e "/$match/s/^$match//"
but this only seems to make a nominal amount of difference.
Even without the grep part of the script there's not much difference.
WHat about Perl?
Are my ambitions realistic, or is this about the best performance possible.
Thanks in advance.
Nick
########################################
######
#!/bin/bash
#exclusion=
#list=
echo; ls
echo
echo -en "What is the file to modify? "; read list
echo
echo -en "What is the exclusion file? "; read exclusion
echo
[ -f replace_match.log ] && mv -f replace_match.log replace_match.log.old
for match in `cat $exclusion`; do
grep "$match" "$list" 1>/dev/null && echo -e "MATCH ---------> $match" \[color=darkred]
cat $list | sed -e "/$match/s/^$match//" > $list.mod
mv $list.mod $list
echo "$match"
done
########################################
############
<Quote>
Just look at Apple's success with its iTunes/ iPod platform to see
how damaging Gates' digital rights management strategy is to his
credibility. Instead of protecting users, he's protecting applications,
intellectual property and business models. </Quote>
| |
| Chris F.A. Johnson 2004-03-26, 11:47 am |
| On Fri, 26 Mar 2004 at 05:12 GMT, null_device@ssl-mail.com wrote:
> Hi all,
>
> I am running qmail/ezmlm on RH9 with many mailing lists.
> I have this script <appendage>, that I knocked up to apply some
> exclusion logic to some mail lists, no problem.
>
> Now, I have just had a few large-o-rama lists delivered to my desk
> to be incorporated into this system.
>
> Problem is, my script takes *too* long, to go through these lists.
> For now, I am just running this script on various boxes in the background.
>
> However, I want to discover a more elegant way of doing this.
>
> Could anyone make any suggestions as to a more efficient way to use
> sed/grep?
>
> I have tried adding in the $match string at the start for example:
> sed -e "/$match/s/^$match//"
>
> but this only seems to make a nominal amount of difference.
>
> Even without the grep part of the script there's not much difference.
>
> WHat about Perl?
>
> Are my ambitions realistic, or is this about the best performance possible.
Without knowing exactly what you are doing (i.e., what is the
format of the files?), it's hard to give a specific answer.
However, the reason your script is so slow (even without the two
unnecessary cats) is that it is processing the file over and over
again, once for each word in the exclusion file.
> ########################################
######
> #!/bin/bash
>
> #exclusion=
> #list=
>
> echo; ls
>
> echo
> echo -en "What is the file to modify? "; read list
> echo
> echo -en "What is the exclusion file? "; read exclusion
> echo
>
> [ -f replace_match.log ] && mv -f replace_match.log replace_match.log.old
>
>
> for match in `cat $exclusion`; do
This is rarely the best way; use "while read; do ...;done < $exclusion".
> grep "$match" "$list" 1>/dev/null && echo -e "MATCH ---------> $match" \
>
> cat $list | sed -e "/$match/s/^$match//" > $list.mod
There is no need for cat:
sed -e "/$match/s/^$match//" "$list" > "$list.mod"
> mv $list.mod $list
>
> echo "$match"
> done
>
>
> ########################################
############
The best-case scenario would be if there is a line-to-line
correspondence between the files; then you could replace the loop
with:
grep -f "$exclusion" "$list" > "$list.mod"
If you want to list the deleted lines:
comm -23 "$list" "$list.mod"
--
Chris F.A. Johnson http://cfaj.freeshell.org/shell
========================================
===========================
My code (if any) in this post is copyright 2004, Chris F.A. Johnson
and may be copied under the terms of the GNU General Public License
| |
| Michael Tosch 2004-03-26, 11:47 am |
| In article <c40o1u$2cl3iq$1@ID-210011.news.uni-berlin.de>, "Chris F.A. Johnson" <c.fa.johnson@rogers.com> writes:
> On Fri, 26 Mar 2004 at 05:12 GMT, null_device@ssl-mail.com wrote:
>
> Without knowing exactly what you are doing (i.e., what is the
> format of the files?), it's hard to give a specific answer.
>
> However, the reason your script is so slow (even without the two
> unnecessary cats) is that it is processing the file over and over
> again, once for each word in the exclusion file.
>
>
> This is rarely the best way; use "while read; do ...;done < $exclusion".
>
>
> There is no need for cat:
>
> sed -e "/$match/s/^$match//" "$list" > "$list.mod"
.... and no need for /$match/ because /^$match/ matches a subset:
sed "s/^$match//" "$list" > "$list.mod"
>
>
> The best-case scenario would be if there is a line-to-line
> correspondence between the files; then you could replace the loop
> with:
>
> grep -f "$exclusion" "$list" > "$list.mod"
>
> If you want to list the deleted lines:
>
> comm -23 "$list" "$list.mod"
>
>
> --
> Chris F.A. Johnson http://cfaj.freeshell.org/shell
> ========================================
===========================
> My code (if any) in this post is copyright 2004, Chris F.A. Johnson
> and may be copied under the terms of the GNU General Public License
--
Michael Tosch
IT Specialist
HP Managed Services Germany
Phone +49 2407 575 313
Mail: michael.tosch@hp.com
| |
| Walter Briscoe 2004-03-26, 11:47 am |
| In message <tae760l5l9ojpkc7j4cili6os6h3fvjpb1@4ax.com> of Fri, 26 Mar
2004 16:12:41 in comp.unix.shell, null_device@ssl-mail.com writes
>Hi all,
>
>I am running qmail/ezmlm on RH9 with many mailing lists.
>I have this script <appendage>, that I knocked up to apply some
>exclusion logic to some mail lists, no problem.
>
>Now, I have just had a few large-o-rama lists delivered to my desk
>to be incorporated into this system.
>
>Problem is, my script takes *too* long, to go through these lists.
>For now, I am just running this script on various boxes in the background.
>
>However, I want to discover a more elegant way of doing this.
>
>Could anyone make any suggestions as to a more efficient way to use
>sed/grep?
[snip]
>cat $list | sed -e "/$match/s/^$match//" > $list.mod
This is an UUOC - Unnecessary use of cat. The following is equivalent
but uses less resources - the difference is aesthetic and any difference
in execution time is likely to be impossible to measure.
< $list sed -e "/$match/s/^$match//" > $list.mod
You can also do:
sed -e "/$match/s/^$match//" $list > $list.mod
Your problem is that you are reading the data many times and starting
many processes in doing so. You might try:
grep -v -f $exclusion $list > $list.mod
If that does more than you need, have your script use sed on $list with
a script created by using sed on $exclusion. i.e. something like:
sed -e 's:/:\\/:g;s:.*:/^&/s/^&//' $exclusion > exclusion.sed
sed -f exclusion.sed $list > $list.mod
It is possible to combine the effects of both sed calls into a single
line. I think the increase in complexity is not worthwhile.
Note that I allow the possibility of slash (/) in each match.
Your sed may not be able to handle more than a fixed number of commands.
You might also try: sed -e "/^$match/s///" $list > $list.mod
This notation does not seem to be in POSIX but I have seen it equivalent
to: sed -e "/^$match/s/^$match//" $list > $list.mod
This has been seen to be slower: sed -e "s/^$match//" $list > $list.mod
Please report your experience. Ideally, quote elapsed times for the same
task with different approaches.
--
Walter Briscoe
| |
| rakesh sharma 2004-03-26, 11:47 am |
| null_device@ssl-mail.com wrote:
>
> I am running qmail/ezmlm on RH9 with many mailing lists.
> I have this script <appendage>, that I knocked up to apply some
> exclusion logic to some mail lists, no problem.
>
> Now, I have just had a few large-o-rama lists delivered to my desk
> to be incorporated into this system.
>
> Problem is, my script takes *too* long, to go through these lists.
> For now, I am just running this script on various boxes in the background.
>
> However, I want to discover a more elegant way of doing this.
>
> Could anyone make any suggestions as to a more efficient way to use
> sed/grep?
>
> I have tried adding in the $match string at the start for example:
> sed -e "/$match/s/^$match//"
>
> but this only seems to make a nominal amount of difference.
>
> Even without the grep part of the script there's not much difference.
>
> ########################################
######
> #!/bin/bash
>
> #exclusion=
> #list=
>
> echo; ls
>
> echo
> echo -en "What is the file to modify? "; read list
> echo
> echo -en "What is the exclusion file? "; read exclusion
> echo
>
> [ -f replace_match.log ] && mv -f replace_match.log replace_match.log.old
>
>
> for match in `cat $exclusion`; do
>
> grep "$match" "$list" 1>/dev/null && echo -e "MATCH ---------> $match" \
>
> cat $list | sed -e "/$match/s/^$match//" > $list.mod
> mv $list.mod $list
>
> echo "$match"
> done
>
Without looking at the kinds of inputs (files, exlcusion lists,etc.),
it's hard to say what would work the best; but a start can be:
i) from the exclusion file, create a sed commands file on the fly.
ii) operate the sed commands file on the list file.
# create a commnds file
sed -e 's/.*/:^&:s:::;t/' "$exclusion" > commands_file
# use the commands file generated above to run on list file
sed -f commands_file "$list" > "${list}.mod"
mv -f -- "${list}.mod" "$list"
> WHat about Perl?
>
> Are my ambitions realistic, or is this about the best performance possible
Perl definitely would be the way to go for such
large-file manipulation scenarios.
| |
| Ed Morton 2004-03-26, 11:47 am |
|
null_device@ssl-mail.com wrote:
> Hi all,
>
> I am running qmail/ezmlm on RH9 with many mailing lists.
> I have this script <appendage>, that I knocked up to apply some
> exclusion logic to some mail lists, no problem.
You already got a bunch of good answers, but in case it helps, here's
one trivial way you could do it in awk IF the pattern you're matching on
is actually the first space-separated field of the "list" file and the
exclusions are 1-per-line:
awk 'NR == FNR { excl[$1] = ""; next }
$1 in excl { print "MATCH $match" > "replace_match.log"; next }
{ print }' $exclusion $list
If the fields aren't space-separated, just use -F to specify the separator.
Regards,
Ed.
| |
| John W. Krahn 2004-03-26, 8:35 pm |
| null_device@ssl-mail.com wrote:
>
> I am running qmail/ezmlm on RH9 with many mailing lists.
> I have this script <appendage>, that I knocked up to apply some
> exclusion logic to some mail lists, no problem.
>
> Now, I have just had a few large-o-rama lists delivered to my desk
> to be incorporated into this system.
>
> Problem is, my script takes *too* long, to go through these lists.
> For now, I am just running this script on various boxes in the background.
>
> However, I want to discover a more elegant way of doing this.
>
> Could anyone make any suggestions as to a more efficient way to use
> sed/grep?
>
> I have tried adding in the $match string at the start for example:
> sed -e "/$match/s/^$match//"
>
> but this only seems to make a nominal amount of difference.
>
> Even without the grep part of the script there's not much difference.
>
> WHat about Perl?
Yes, you could do it in perl:
#!/usr/bin/perl
my $log = 'replace_match.log';
rename $log, "$log.old" or warn "Cannot rename $log to $log.old: $!";
open LOG, ">>$log" or die "Cannot open $log: $!";
print "\n";
system 'ls';
print "\nWhat is the file to modify? ";
chomp( my $list = <STDIN> );
# set in-place edit variable to modify
# the files in @ARGV in-place
( $^I, @ARGV ) = ( '', $list );
print "\nWhat is the exclusion file? ";
chomp( my $exclusion = <STDIN> );
open IN, "<$exclusion" or die "Cannot open $exclusion: $!";
chomp( my @exclusion = <IN> );
my $regex = join '|', sort { length $b <=> length $a } @exclusion;
while ( <> ) {
if ( s/^($match)// ) {
print LOG "MATCH ---------> $1";
}
print;
}
__END__
John
--
use Perl;
program
fulfillment
| |
| null_device@ssl-mail.com 2004-03-28, 8:35 pm |
| [snip]
>
>Please report your experience. Ideally, quote elapsed times for the same
>task with different approaches.
Will do, but it's not going to be overnight 
Looks like I got what I asked for, hmmm. This is going to take some
referencing, Maybe I haven't looked in the right spot, but *finding*
good advanced SED documentation seems to be harder than learning it.
Anyway, this stuff is priceless. Thanks so much guys.
Nick Sinclair
________________________________________
_________________________________
....You've operated behind-the-scenes to suborn the trust of a man who has
stamped you with his imprimatur of class and elegance and stature. I've seen
all kinds and degrees of deception in my time, but this man has been on
the receiving end of machinations so Machiavellian that it has rarely been my
experience to encounter. And yet he has combatted them stoically, and
selflessly, without revealing my identity. Had he violated the vow of secrecy
he took, his task would have been far easier...
________________________________________
_________________________________
| |
| null_device@ssl-mail.com 2004-03-28, 8:36 pm |
| [snip]
> Without knowing exactly what you are doing (i.e., what is the
> format of the files?), it's hard to give a specific answer.
Sorry, the files are just simply 'one email address per line' ascii text.
N
<Quote>
Just look at Apple's success with its iTunes/ iPod platform to see
how damaging Gates' digital rights management strategy is to his
credibility. Instead of protecting users, he's protecting applications,
intellectual property and business models. </Quote>
|
|
|
|
|