Unix Shell - How to extract certain part of an regexp?

This is Interesting: Free IT Magazines  
Home > Archive > Unix Shell > December 2007 > How to extract certain part of an regexp?





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author How to extract certain part of an regexp?
byang

2007-12-17, 7:34 am

Hi,
I know I can "grep -o " to find the whole regexp pattern, but how
can I extract the part of the regexp pattern? For example:
There is a string:

The report number id is <12345> and index is 23;

I want to extract the number 12345 only. Thank you very much!

Regards!
Bo

Janis

2007-12-17, 7:34 am

On 17 Dez., 09:40, byang <techr...@eyou.com> wrote:
> Hi,
> I know I can "grep -o " to find the whole regexp pattern, but how


$ grep -o
grep: Unknown option -o

> can I extract the part of the regexp pattern? For example:
> There is a string:
>
> The report number id is <12345> and index is 23;
>
> I want to extract the number 12345 only. Thank you very much!


What is the pattern criterion for the extraction?
Assuming it's the angle brackets...

awk 'match($0,/<.*>/){print substr($0,RSTART+1,RLENGTH-2)}'


Janis

>
> Regards!
> Bo


Stephane Chazelas

2007-12-17, 7:34 am

On Mon, 17 Dec 2007 16:40:14 +0800, byang wrote:
> Hi,
> I know I can "grep -o " to find the whole regexp pattern, but how


Note that -o is a GNU extension.

> can I extract the part of the regexp pattern? For example:
> There is a string:
>
> The report number id is <12345> and index is 23;
>
> I want to extract the number 12345 only. Thank you very much!

[...]

perl -lne 'print for /id is <(\d+)>/g'

--
Stephane
Bo Yang

2007-12-17, 1:29 pm

Janis
> On 17 Dez., 09:40, byang <techr...@eyou.com> wrote:
>
> $ grep -o
> grep: Unknown option -o
>
>
> What is the pattern criterion for the extraction?
> Assuming it's the angle brackets...

Yes!
>
> awk 'match($0,/<.*>/){print substr($0,RSTART+1,RLENGTH-2)}'


Thank you very much, it works well. But is there any awk solution such
like the sed one. I mean in sed, when I replace some string, I can use
() to compose regexp unit and then substitue them later. Is there any
same way for awk but for displaying the regexp unit?

Regards!
Bo
Bo Yang

2007-12-17, 1:29 pm

Stephane Chazelas :
> On Mon, 17 Dec 2007 16:40:14 +0800, byang wrote:
>
> Note that -o is a GNU extension.
>
> [...]
>
> PERL -lne 'print for /id is <(\d+)>/g'
>

After some Googleing, I found no explanation for the /id, could you
pleae explain for me more? Thank you very much!

Regards!
Bo
Janis

2007-12-17, 1:29 pm

On 17 Dez., 14:46, Bo Yang <struggl...@gmail.com> wrote:
> Janis
>
>
>
>
>
>
>
>
> Yes!
>
>
> Thank you very much, it works well. But is there any awk solution such
> like the sed one.


Which one? You didn't quote any. I suppose something like...

sed '/<.*>/s/.*<\(.*\)>.*/\1/'

I'm note quite sure whether backreferences are supported by all awk's.
The awk statement similar to the sed pattern is...

sub(/.*<(.*)>.*/,"\1")

> I mean in sed, when I replace some string, I can use
> () to compose regexp unit and then substitue them later. Is there any
> same way for awk but for displaying the regexp unit?


Personally I find the match/substr construct much clearer.

Janis

>
> Regards!
> Bo


Janis

2007-12-17, 1:29 pm

On 17 Dez., 14:59, Bo Yang <struggl...@gmail.com> wrote:
> Stephane Chazelas :
>
>
>
>
>
>
>
> After some Googleing, I found no explanation for the /id, could you
> pleae explain for me more? Thank you very much!


perl -lne 'print for /............./g'

Just a pattern "id is <(\d+)>"

Janis

> Regards!
> Bo


Ed Morton

2007-12-17, 1:29 pm



On 12/17/2007 8:22 AM, Janis wrote:
> On 17 Dez., 14:46, Bo Yang <struggl...@gmail.com> wrote:
>
>
>
> Which one? You didn't quote any. I suppose something like...
>
> sed '/<.*>/s/.*<\(.*\)>.*/\1/'
>
> I'm note quite sure whether backreferences are supported by all awk's.
> The awk statement similar to the sed pattern is...
>
> sub(/.*<(.*)>.*/,"\1")


AFAIK, no awks support backrefeneces in sub() or gsub(). GNU awk supports that
in gensub() only:

print gensub(/.*<(.*)>.*/,"\\1","")

Regards,

Ed.

>
>
>
> Personally I find the match/substr construct much clearer.
>
> Janis
>
>
>
>


Janis

2007-12-17, 1:29 pm

On 17 Dez., 15:41, Ed Morton <mor...@lsupcaemnt.com> wrote:
> On 12/17/2007 8:22 AM, Janis wrote:
>
>
>
> AFAIK, no awks support backrefeneces in sub() or gsub().


$ echo "id is <12345> and index" | awk 'sub(/.*<(.*)>.*/,"\1")'
12345

....this is the awk shipped with MKS.

Janis

> GNU awk supports that in gensub() only:
>
> print gensub(/.*<(.*)>.*/,"\\1","")
>
> Regards,
>
> Ed.

Stephane Chazelas

2007-12-17, 1:29 pm

On Mon, 17 Dec 2007 07:27:15 -0800 (PST), Janis wrote:
> On 17 Dez., 15:41, Ed Morton <mor...@lsupcaemnt.com> wrote:
>
> $ echo "id is <12345> and index" | awk 'sub(/.*<(.*)>.*/,"\1")'
> 12345
>
> ...this is the awk shipped with MKS.

[...]

So how do you subsitute something with the ^A character with
that awk? Do you have to include 0s: sub(/.../, "\01")?

--
Stephane
Ed Morton

2007-12-17, 1:29 pm



On 12/17/2007 9:27 AM, Janis wrote:
> On 17 Dez., 15:41, Ed Morton <mor...@lsupcaemnt.com> wrote:
>
>
>
> $ echo "id is <12345> and index" | awk 'sub(/.*<(.*)>.*/,"\1")'
> 12345
>
> ...this is the awk shipped with MKS.
>


Interesting. Is there any particular name for that awk (like gawk or nawk or
tawk or mawk or...)? How do you get it to behave like other awks when you want
to replace the matched RE with a "\1" (control-A) char:

$ echo "a<b>c" | awk 'sub(/<(.*)>/,"\1")'
a?c

Regards,

Ed.

Janis

2007-12-17, 1:29 pm

On 17 Dez., 16:41, Ed Morton <mor...@lsupcaemnt.com> wrote:
> On 12/17/2007 9:27 AM, Janis wrote:
>
>
>
>
>
>
> Interesting. Is there any particular name for that awk (like gawk or nawk or
> tawk or mawk or...)?


It's just an "awk.exe", in the MKS/mksnt installation directory.
Inspecting the raw exe code just provides source strings like...
//centi.mks.com/rd/src/awk/rcs/awk0.c 1.37 1999/05/14
I can't seem to find any version information.
The man page says (WRT portability)...
POSIX.2. x/OPEN Portability Guide 4.0. All UNIX systems.
Windows 95/98/Millennium. Windows NT 4.0. Windows 2000.

> How do you get it to behave like other awks when you want
> to replace the matched RE with a "\1" (control-A) char:


That was, actually, an inaccuracy of mine. While the MKS awk indeed
accepts...

awk 'sub(/<(.*)>/,"\1")'

in that context it would have been better to write...

awk 'sub(/<(.*)>/,"\\1")'

(With awk I haven't yet played with <ctrl> sequences.)

Janis

> $ echo "a<b>c" | awk 'sub(/<(.*)>/,"\1")'
> a?c
>
> Regards,
>
> Ed.


byang

2007-12-18, 1:41 am

Ed Morton wrote:
>
> On 12/17/2007 8:22 AM, Janis wrote:
>
> AFAIK, no awks support backrefeneces in sub() or gsub(). GNU awk supports that
> in gensub() only:
>
> print gensub(/.*<(.*)>.*/,"\\1","")


I run echo "a<b>c" | awk 'print gensub(/.*<(.*)>.*/,"\\1","")' and got
the following errors:

awk: cmd. line:1: print gensub(/.*<(.*)>.*/,"\\1","")
awk: cmd. line:1: ^ syntax error

Could you please tell what is the error? Thanks!

Regards!
Bo


byang

2007-12-18, 1:41 am

Janis wrote:
> On 17 Dez., 14:59, Bo Yang <struggl...@gmail.com> wrote:
>
> PERL -lne 'print for /............./g'
>
> Just a pattern "id is <(\d+)>"


Ah-oh, how silly I am! Thank you very much!

Regards!
Bo



byang

2007-12-18, 1:41 am

Janis wrote:
> On 17 Dez., 14:59, Bo Yang <struggl...@gmail.com> wrote:
>
> PERL -lne 'print for /............./g'
>
> Just a pattern "id is <(\d+)>"


And one more question, if I have a string

<1234> <54321>, I want to print only the first one "1234". How can I
modify the above PERL command to achieve this?

Thanks!


Bill Marcum

2007-12-18, 1:41 am

On 2007-12-18, byang <techrazy@eyou.com> wrote:
>
> I run echo "a<b>c" | awk 'print gensub(/.*<(.*)>.*/,"\\1","")' and got
> the following errors:
>
> awk: cmd. line:1: print gensub(/.*<(.*)>.*/,"\\1","")
> awk: cmd. line:1: ^ syntax error
>
> Could you please tell what is the error? Thanks!
>

It should be awk '{print gensub(/.*<(.*)>.*/,"\\1","")}'
Stephane Chazelas

2007-12-18, 7:33 am

On Tue, 18 Dec 2007 12:50:31 +0800, byang wrote:
> Janis wrote:
>
> And one more question, if I have a string
>
> <1234> <54321>, I want to print only the first one "1234". How can I
> modify the above PERL command to achieve this?

[...]

Take off the "g".

perl -lne 'print for /<(\d+)>/'

Or to print the 3rd one:

perl -lne 'print for (/<(\d+)>/g)[2]'

(indices start at 0 in perl).

--
Stephane
Bo Yang

2007-12-18, 7:33 am

Stephane Chazelas :
> On Tue, 18 Dec 2007 12:50:31 +0800, byang wrote:
> [...]
>
> Take off the "g".
>
> PERL -lne 'print for /<(\d+)>/'
>
> Or to print the 3rd one:
>
> PERL -lne 'print for (/<(\d+)>/g)[2]'


Thank you very much!

Regards!
Bo
Bo Yang

2007-12-18, 7:33 am

Bill Marcum :
> On 2007-12-18, byang <techrazy@eyou.com> wrote:
> It should be awk '{print gensub(/.*<(.*)>.*/,"\\1","")}'


Thank you very much!

Regards!
Bo
Bo Yang

2007-12-18, 7:33 am

Stephane Chazelas :
> On Tue, 18 Dec 2007 12:50:31 +0800, byang wrote:
> [...]
>
> Take off the "g".
>
> PERL -lne 'print for /<(\d+)>/'
>
> Or to print the 3rd one:
>
> PERL -lne 'print for (/<(\d+)>/g)[2]'


I read some more document about PERL one hour ago. And I think the

print for (/<(\d+)>/g)[2] can be replaced with
print (/<(\d+)>/g)[2]

because the [2] require a list context, the /<(\d+)>/g will create one
for it and then it return the third element in the list. But is failed,
I am wondering why? Could you please help more? Thanks very much!

Regards!
Bo
Stephane Chazelas

2007-12-18, 1:32 pm

On Tue, 18 Dec 2007 21:07:01 +0800, Bo Yang wrote:
p,,,[
>
> I read some more document about PERL one hour ago. And I think the
>
> print for (/<(\d+)>/g)[2] can be replaced with
> print (/<(\d+)>/g)[2]
>
> because the [2] require a list context, the /<(\d+)>/g will create one
> for it and then it return the third element in the list. But is failed,
> I am wondering why? Could you please help more? Thanks very much!

[...]

Try PERL -lne 'print((/<(\d+)>/g)[2])'

I use "print for" in "perl -le" by habit.

As an alternative to

perl -lne 'print for /.../g'

you can also do:

perl -ne 'BEGIN{$\=$,="\n"}print/.../g'

--
Stephane

John W. Krahn

2007-12-18, 1:32 pm

Bo Yang wrote:
>
> Stephane Chazelas :
>
> I read some more document about PERL one hour ago. And I think the
>
> print for (/<(\d+)>/g)[2] can be replaced with
> print (/<(\d+)>/g)[2]
>
> because the [2] require a list context, the /<(\d+)>/g will create one
> for it and then it return the third element in the list. But is failed,
> I am wondering why?


It failed because you changed it from using the print operator (no
parentheses) to using the print() function (with parentheses.)

perldoc perlop

DESCRIPTION
Terms and List Operators (Leftward)

A TERM has the highest precedence in Perl. They include
variables, quote and quote-like operators, any expression
in parentheses, and any function whose arguments are
parenthesized. Actually, there aren't really functions in
this sense, just list operators and unary operators
behaving as functions because you put parentheses around
the arguments. These are all documented in the perlfunc
manpage.

If any list operator (print(), etc.) or any unary operator
(chdir(), etc.) is followed by a left parenthesis as the
next token, the operator and arguments within parentheses
are taken to be of highest precedence, just like a normal
function call.

In the absence of parentheses, the precedence of list
operators such as `print', `sort', or `chmod' is either
very high or very low depending on whether you are looking
at the left side or the right side of the operator. For
example, in

@ary = (1, 3, sort 4, 2);
print @ary; # prints 1324

the commas on the right of the sort are evaluated before
the sort, but the commas on the left are evaluated after.
In other words, list operators tend to gobble up all
arguments that follow, and then act like a simple TERM
with regard to the preceding expression. Be careful with
parentheses:

# These evaluate exit before doing the print:
print($foo, exit); # Obviously not what you want.
print $foo, exit; # Nor is this.

# These do the print before evaluating exit:
(print $foo), exit; # This is what you want.
print($foo), exit; # Or this.
print ($foo), exit; # Or even this.

Also note that

print ($foo & 255) + 1, "\n";

probably doesn't do what you expect at first glance. See
the Named Unary Operators entry elsewhere in this document
for more discussion of this.


So in your example:

print (/<(\d+)>/g)[2]

You have the print function:

print(/<(\d+)>/g)

Followed by:

[2]

Which is a syntax error.



John
--
use Perl;
program
fulfillment
mik3l3374@gmail.com

2007-12-20, 7:35 am

On Dec 17, 4:40 pm, byang <techr...@eyou.com> wrote:
> Hi,
> I know I can "grep -o " to find the whole regexp pattern, but how
> can I extract the part of the regexp pattern? For example:
> There is a string:
>
> The report number id is <12345> and index is 23;
>
> I want to extract the number 12345 only. Thank you very much!
>
> Regards!
> Bo


in bash

# s="The report number id is <12345> and index is 23;"
# IFS="<>"
# set -- $s
# echo $2

Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com