Unix questions - Grep last matching string in huge file

This is Interesting: Free IT Magazines  
Home > Archive > Unix questions > September 2006 > Grep last matching string in huge file





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Grep last matching string in huge file
sergei.sheinin@db.com

2006-09-06, 7:30 am

Hello, All!

I need to grep for a value that is contained within a string that
matches a certain pattern. The string looks something like this:

***cookie*** cookie_id_name=XXX

so, i have a pattern consisting of string "***cookie*** cookie_id_name"
-that's the whole string, no problem defining that. What I need is the
XXX value, which I later derive from the string.

The command I use is this:

grep '"***cookie*** cookie_id_name' log_file_name.log | tail -1


The problem is those log files are huge, and there are about 20 of them
in all. When I run the above grep command, it takes unacceptably
sizeable system resources to perform.


Could anyone pls make a suggestion on how to make this easier?


Note: this type of string is not the only thing in the log files, which
means that I never know how many strings away from the end of the file
it may be found (maybe 10 maybe 10000). Idea to use a command like
tail -10000 log_file_name.log | grep '"***cookie*** cookie_id_name' |
tail -1
is the best I can think of, but not good enough in my view.




Suggestions are appreciated!

Rainer Temme

2006-09-06, 7:30 am

Sergei,

to me, it seems that you need to solve two problems ...

firstly ... the logfiles are huge ... you dont want to
read the whole file, you only want to read the smallest
possible part of it.

secondly ... you look for a pattern.

The solution woul be to read the lines in the file in
reverse order (back to front) rather than in normal
order...
AND
stop reading the file with the first occurance
if the pattern you search.

This might eventually require a 'roll your own'
solution. It shouldn't be too hard to write
a program, that read in the file from back to
front, and outputs complete lines.
Patternmatching could be done with the
functions from regex.h.

Another solution might be to keep the logfiles shorter.

Rainer

Ed Morton

2006-09-06, 1:37 pm

sergei.sheinin@db.com wrote:

> Hello, All!
>
> I need to grep for a value that is contained within a string that
> matches a certain pattern. The string looks something like this:
>
> ***cookie*** cookie_id_name=XXX
>
> so, i have a pattern consisting of string "***cookie*** cookie_id_name"
> -that's the whole string, no problem defining that. What I need is the
> XXX value, which I later derive from the string.
>
> The command I use is this:
>
> grep '"***cookie*** cookie_id_name' log_file_name.log | tail -1
>
>
> The problem is those log files are huge, and there are about 20 of them
> in all. When I run the above grep command, it takes unacceptably
> sizeable system resources to perform.
>
>
> Could anyone pls make a suggestion on how to make this easier?
>
>
> Note: this type of string is not the only thing in the log files, which
> means that I never know how many strings away from the end of the file
> it may be found (maybe 10 maybe 10000). Idea to use a command like
> tail -10000 log_file_name.log | grep '"***cookie*** cookie_id_name' |
> tail -1
> is the best I can think of, but not good enough in my view.
>


tac file | grep -m 1 <pattern>

Ed.
Rainer Temme

2006-09-06, 1:37 pm

Ed Morton wrote:
> tac file | grep -m 1 <pattern>


Bingo ...

tac ... cat ... nice little pun ...
when I thought about the problem I somehow
felt 'there should be a program for this already' ...
but didn't find anything, but reversing cat did't come
to my mind.

Learned something new today ;-)

Rainer
sergei.sheinin@db.com

2006-09-06, 1:37 pm


>
> tac file | grep -m 1 <pattern>
>
> Ed.



sounds sweet, but what's "tac"? i don't have it as a recognized
command...

sergei.sheinin@db.com

2006-09-06, 1:37 pm


Rainer Temme wrote:
> Ed Morton wrote:
>
> Bingo ...
>
> tac ... cat ... nice little pun ...
> when I thought about the problem I somehow
> felt 'there should be a program for this already' ...
> but didn't find anything, but reversing cat did't come
> to my mind.
>
> Learned something new today ;-)
>
> Rainer


"pun" as in "fun"? how do I get this to work? also, grep doesn't have
the "-m" option.

Chris F.A. Johnson

2006-09-06, 1:37 pm

On 2006-09-06, sergei.sheinin@db.com wrote:
> Ed Morton wrote:
>
> sounds sweet, but what's "tac"? i don't have it as a recognized
> command...


Neither tac (concatenate and print files in reverse) nor the -m
option to grep are standard. They are part of the GNU utilities.

--
Chris F.A. Johnson, author | <http://cfaj.freeshell.org>
Shell Scripting Recipes: | My code in this post, if any,
A Problem-Solution Approach | is released under the
2005, Apress | GNU General Public Licence
Rainer Temme

2006-09-06, 1:37 pm

sergei.sheinin@db.com wrote:
> Rainer Temme wrote:
[vbcol=seagreen]
> "pun" as in "fun"?


spell 'cat' reversed ... tac .... aha ...
a pun can be interpreted as a 'play with words'
and I found the 'tac ... cat' thing a nice one.

> how do I get this to work? also, grep doesn't have
> the "-m" option.


If your environment doesn't have a tac and a
grep with -m option ... get them from the linux-sources.
You should be able to compile them for your platform.

Rainer
sergei.sheinin@db.com

2006-09-06, 1:37 pm


> If your environment doesn't have a tac and a
> grep with -m option ... get them from the linux-sources.
> You should be able to compile them for your platform.
>
> Rainer



sorry, guys, but it's probably not an option under the circumstances. i
work in an environment where all env changes are looked down upon with
a frown (that's for a reason, btw). so i need to make do with what's
available on solaris 5.8.


Sergei.

Rainer Temme

2006-09-06, 1:37 pm

sergei.sheinin@db.com wrote:
> sorry, guys, but it's probably not an option under the circumstances. i
> work in an environment where all env changes are looked down upon with
> a frown (that's for a reason, btw). so i need to make do with what's
> available on solaris 5.8.


A pity ... are you only allowed to use what's already there, or
can you at least introduce small selfwritten progs?

If you have to 'eat whats on the table' and your approach
with tail isn't good enough, you might try this:

- define a blocksize (say 8K)

- get filelength of logfile in bytes (ls -l)
calculate size in blocks (eval)

- use dd to read a N blocks
from EOF_minus_N_blocks to EOF.
and pipe the output to your grep.

- if the line is found you're done.

- If nothing is found, dd from
EOF_minus_2N-1_blocks to EOF_minus_N-1_blocks
(note: yes, there's an overlap to handle the
rare case, that the line is over block-borders)

- repeat this procedure util either a line is found, or
you're at the beginning of the file.

(use dd's count=xxx bs=xxx skip=xxx options to position
in the file)

Rainer
Ed Morton

2006-09-06, 1:37 pm

sergei.sheinin@db.com wrote:
>
>
>
> sorry, guys, but it's probably not an option under the circumstances. i
> work in an environment where all env changes are looked down upon with
> a frown (that's for a reason, btw). so i need to make do with what's
> available on solaris 5.8.


I don't know if it'll be any faster, but you could do something like
this (untested) to step back "delta" lines at a time:

size=`wc -l < file`
delta=1000
end="$size"
start=$(( end - delta ))
while (( start > 0 ))
do
start=$(( end - delta ))
sed -n "${start},${end}p" | grep pattern
end=$(( start - 1 ))
done

Check the logic.

You could also try using awk or sed instead of grep to find your pattern
to see if they're any faster.

Ed.
sergei.sheinin@db.com

2006-09-06, 1:37 pm

Ed Morton wrote:
> sergei.sheinin@db.com wrote:
>
> I don't know if it'll be any faster, but you could do something like
> this (untested) to step back "delta" lines at a time:
>
> size=`wc -l < file`
> delta=1000
> end="$size"
> start=$(( end - delta ))
> while (( start > 0 ))
> do
> start=$(( end - delta ))
> sed -n "${start},${end}p" | grep pattern
> end=$(( start - 1 ))
> done
>
> Check the logic.
>
> You could also try using awk or sed instead of grep to find your pattern
> to see if they're any faster.
>
> Ed.



something like that. problem is, using wc -l on that file is also not a
good idea, as it takes (i just checked) about 10 seconds and 2% cpu.

what i'll probably do is this

for (my $i=1; $i<4; $i++)
{
$lines = 1000 * i;
$cmd = "tail -$lines logfile | grep 'pattern' | tail -1" ## i tested
this command, the performance is great!
@result = `$cmd`;
if (#$result >0)
blalbalba...
}


this loop should work after the first iteration in 90+% of the cases.
if after three iterations is still doesn't get what it's looking for,
then a warinng message will be emailed. it's acceptable under the
circumstances ;)


Sergei.

Stephane Chazelas

2006-09-08, 1:30 am

On 6 Sep 2006 06:33:58 -0700, sergei.sheinin@db.com wrote:
>
>
>
> sorry, guys, but it's probably not an option under the circumstances. i
> work in an environment where all env changes are looked down upon with
> a frown (that's for a reason, btw). so i need to make do with what's
> available on solaris 5.8.

[...]

The Solaris equivalent to the GNU specific tac and grep -m is:

tail -r file | grep whatever | head -1

tail -r is not standard either.

--
Stephane
Rainer Temme

2006-09-08, 7:43 am

Stephane Chazelas wrote:
> The Solaris equivalent to the GNU specific tac and grep -m is:
>
> tail -r file | grep whatever | head -1
>
> tail -r is not standard either.


Albeit this looks good, it's not as good as tac|grep -m

grep -m stops execution after the first match ... this stops
the running tac-command as well (because its pipe at stdout
is closed at the read-end by the terminated grep).

With your command, "head -1" might have to wait for a while
(because grep might use buffered IO) until the first line
becomes available. So head will stop the execution later
as grep -m would.

But nevertheless, if both commands are available to the OP
its certainly worth a try.

Rainer
sergei.sheinin@db.com

2006-09-08, 7:43 am


Stephane Chazelas wrote:

>
> The Solaris equivalent to the GNU specific tac and grep -m is:
>
> tail -r file | grep whatever | head -1
>
> tail -r is not standard either.
>
> --
> Stephane


awesome! this seems to work!! ))


Sergei.

Stephane Chazelas

2006-09-09, 7:26 am

On Fri, 08 Sep 2006 09:34:46 +0200, Rainer Temme wrote:
> Stephane Chazelas wrote:
>
> Albeit this looks good, it's not as good as tac|grep -m
>
> grep -m stops execution after the first match ... this stops
> the running tac-command as well (because its pipe at stdout
> is closed at the read-end by the terminated grep).
>
> With your command, "head -1" might have to wait for a while
> (because grep might use buffered IO) until the first line
> becomes available. So head will stop the execution later
> as grep -m would.

[...]

You're right.

tail -r file | sed '/whatever/!d;q'

Might reveal faster.

--
Stephane
Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com