Unix Shell - Extract a substring of n digits from a string

This is Interesting: Free IT Magazines  
Home > Archive > Unix Shell > October 2005 > Extract a substring of n digits from a string





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Extract a substring of n digits from a string
Giacomo

2005-10-26, 8:51 pm

I need to extract a substring of n adjacent digits from every single
line of a file. The position of the n digits are different from line to
line.

For example:

asdasd 123 asd 191991 1234
lijoioi 4567 asdi 67567 iojoii

For n=4 the result for each line must be 1234 e 4567.

Thanks in advance,
Giacomo.
Janis Papanagnou

2005-10-26, 8:51 pm

Giacomo wrote:
> I need to extract a substring of n adjacent digits from every single
> line of a file. The position of the n digits are different from line to
> line.


What type of shell or programs you do have to use?

What have you tried to program thus far?

General outline, for example...
Depending on whether the shell/tool/program supports extended regular
expressions or not you have to either define a regexp like [0-9]{n} or
construct one from n sequences of [0-9]. This regexp must be embedded
within white space [ \t] or non-numerical patterns [^0-9] depending on
your requirements. Take care of the line boundaries, so you'll likely
have to consider start of line ^ for the left and end of line $ for
the right boundary. Finally extract the substring from the matching
part. Consider to add spaces to the front and read of the input line
to simplify the matching and extraction of the substring pattern.

> For example:
>
> asdasd 123 asd 191991 1234
> lijoioi 4567 asdi 67567 iojoii
>
> For n=4 the result for each line must be 1234 e 4567.


Janis
William James

2005-10-26, 8:51 pm


Giacomo wrote:
> I need to extract a substring of n adjacent digits from every single
> line of a file. The position of the n digits are different from line to
> line.
>
> For example:
>
> asdasd 123 asd 191991 1234
> lijoioi 4567 asdi 67567 iojoii
>
> For n=4 the result for each line must be 1234 e 4567.
>
> Thanks in advance,
> Giacomo.


ruby -ne 'puts $1 if /(?:^|\D)(\d{4})(?!\d)/'

William Park

2005-10-27, 2:48 am

Giacomo <a@b.cde> wrote:
> I need to extract a substring of n adjacent digits from every single
> line of a file. The position of the n digits are different from line to
> line.
>
> For example:
>
> asdasd 123 asd 191991 1234
> lijoioi 4567 asdi 67567 iojoii
>
> For n=4 the result for each line must be 1234 e 4567.


a='asdasd 123 asd 191991 1234 lijoioi 4567 asdi 67567 iojoii'
RE='\<[0-9]{4}\>'
echo "${a|+$RE}"

Ref:
http://home.eol.ca/~parkw/index.htm...meter_expansion

--
William Park <opengeometry@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
Ed Morton

2005-10-27, 2:48 am

Giacomo wrote:

> I need to extract a substring of n adjacent digits from every single
> line of a file. The position of the n digits are different from line to
> line.
>
> For example:
>
> asdasd 123 asd 191991 1234
> lijoioi 4567 asdi 67567 iojoii
>
> For n=4 the result for each line must be 1234 e 4567.
>
> Thanks in advance,
> Giacomo.


Using a POSIX awk:

awk '{for (i=1;i<=NF;i++) if ($i ~ /^[0-9]{4}$/) print $i}'

To get GNU awk (gawk) to behave like that, use
awk --posix ... or awk --re-interval ....

There are cuter ways to get the same result in awk, but this is the
simplest and most obvious.

Regards,

Ed.
Giacomo

2005-10-28, 4:53 pm

Janis Papanagnou wrote:

> What type of shell or programs you do have to use?


GNU bash, version 3.00.16(1)-release (i386-pc-linux-gnu)


> What have you tried to program thus far?


I tried "expr", but I think it can't work.

Giacomo.
Stephane Chazelas

2005-10-28, 4:53 pm

On Thu, 27 Oct 2005 01:09:37 +0200, Giacomo wrote:
> I need to extract a substring of n adjacent digits from every single
> line of a file. The position of the n digits are different from line to
> line.
>
> For example:
>
> asdasd 123 asd 191991 1234
> lijoioi 4567 asdi 67567 iojoii
>
> For n=4 the result for each line must be 1234 e 4567.

[...]

n=4
sed -n "s/.*/+&+/
s/.*[^0-9]\([0-9]\{$n\}\)[^0-9].*/\1/p" < file

That returns the right-most sequence of 4 digits.

--
Stephane
Dave Gibson

2005-10-28, 4:53 pm

Giacomo <a@b.cde> wrote:
> I need to extract a substring of n adjacent digits from every single
> line of a file. The position of the n digits are different from line to
> line.
>
> For example:
>
> asdasd 123 asd 191991 1234
> lijoioi 4567 asdi 67567 iojoii
>
> For n=4 the result for each line must be 1234 e 4567.
>
> Thanks in advance,
> Giacomo.


#! /usr/bin/awk -f

BEGIN {
sequence = "0123456789"
reqlen = count > 0 ? count : 4
for (i = 1; i <= reqlen; i++)
pat = pat "[0-9]"
}

$0 ~ pat {
for (i = 1; i <= NF; i++) {
s = $i
while (s ~ pat) {
p = substr(s, 1, reqlen)
if (index(sequence, p)) {
# printf "%s, line %d (field %d): %s\n", FILENAME, FNR, i, $i
print p
next
}
s = substr(s, 2)
}
}
}

Copy that into a file and use it like this (replace "script.awk" with
whatever you name it):

awk -v count=4 -f script.awk your_data.files

If you don't ask for a specific sequence length (with -v count=N) the
script assumes 4.
Dan Mercer

2005-10-28, 4:53 pm

"William Park" <opengeometry@yahoo.ca> wrote in message news:f2abd$43603ea4$d8fe9d17$6594@PRIMUS
.CA...
: Giacomo <a@b.cde> wrote:
: > I need to extract a substring of n adjacent digits from every single
: > line of a file. The position of the n digits are different from line to
: > line.
: >
: > For example:
: >
: > asdasd 123 asd 191991 1234
: > lijoioi 4567 asdi 67567 iojoii
: >
: > For n=4 the result for each line must be 1234 e 4567.
:
: a='asdasd 123 asd 191991 1234 lijoioi 4567 asdi 67567 iojoii'
: RE='\<[0-9]{4}\>'
: echo "${a|+$RE}"

This can be down in the shell. Assuming lines containing only lower case letters and
numbers:

ifs=$IFS
IFS="${IFS}abcdefghijklmnopqrstuvwxyz"
while read line
do
IFS=$ifs
set -- $line
IFS=$ifs
set -- $*
for i
do
((${#i} == 4)) && echo "$i"
done
done

Now, this isn't the efficient way to do it. That would probably be
best done in perl.

Dan Mercer

:
: Ref:
: http://home.eol.ca/~parkw/index.htm...meter_expansion
:
: --
: William Park <opengeometry@yahoo.ca>, Toronto, Canada
: ThinFlash: Linux thin-client on USB key (flash) drive
: http://home.eol.ca/~parkw/thinflash.html
: BashDiff: Super Bash shell
: http://freshmeat.net/projects/bashdiff/


William Park

2005-10-28, 4:53 pm

Dan Mercer <dmercer@mn.rr.com> wrote:
> "William Park" <opengeometry@yahoo.ca> wrote in message news:f2abd$43603ea4$d8fe9d17$6594@PRIMUS
.CA...
> : Giacomo <a@b.cde> wrote:
> : > asdasd 123 asd 191991 1234
> : > lijoioi 4567 asdi 67567 iojoii
> : >
> : > For n=4 the result for each line must be 1234 e 4567.
> :
> : a='asdasd 123 asd 191991 1234 lijoioi 4567 asdi 67567 iojoii'
> : RE='\<[0-9]{4}\>'
> : echo "${a|+$RE}"
>
> This can be down in the shell. Assuming lines containing only lower case letters and
> numbers:
>
> ifs=$IFS
> IFS="${IFS}abcdefghijklmnopqrstuvwxyz"
> while read line
> do
> IFS=$ifs
> set -- $line
> IFS=$ifs
> set -- $*
> for i
> do
> ((${#i} == 4)) && echo "$i"
> done
> done
>
> Now, this isn't the efficient way to do it. That would probably be
> best done in perl.


Interesting approach. I would probably do it as
for i in `tr -c '0-9' ' ' < file`; do
[ ${#i} -eq 4 ] && echo $i
done
or if there is lots of repetition,
func()
{
[ ${#1} -eq 4 ]
}
set -- `tr -c '0-9' ' ' < file`
echo ${@|?func}

--
William Park <opengeometry@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
Dan Mercer

2005-10-28, 8:48 pm


"William Park" <opengeometry@yahoo.ca> wrote in message news:5202b$43617e07$d8fea14c$3893@PRIMUS
.CA...
: Dan Mercer <dmercer@mn.rr.com> wrote:
: > "William Park" <opengeometry@yahoo.ca> wrote in message news:f2abd$43603ea4$d8fe9d17$6594@PRIMUS
.CA...
: > : Giacomo <a@b.cde> wrote:
: > : > asdasd 123 asd 191991 1234
: > : > lijoioi 4567 asdi 67567 iojoii
: > : >
: > : > For n=4 the result for each line must be 1234 e 4567.
: > :
: > : a='asdasd 123 asd 191991 1234 lijoioi 4567 asdi 67567 iojoii'
: > : RE='\<[0-9]{4}\>'
: > : echo "${a|+$RE}"
: >
: > This can be down in the shell. Assuming lines containing only lower case letters and
: > numbers:
: >
: > ifs=$IFS
: > IFS="${IFS}abcdefghijklmnopqrstuvwxyz"
: > while read line
: > do
: > IFS=$ifs
: > set -- $line
: > IFS=$ifs
: > set -- $*
: > for i
: > do
: > ((${#i} == 4)) && echo "$i"
: > done
: > done
: >
: > Now, this isn't the efficient way to do it. That would probably be
: > best done in perl.
:
: Interesting approach. I would probably do it as
: for i in `tr -c '0-9' ' ' < file`; do
: [ ${#i} -eq 4 ] && echo $i
: done
: or if there is lots of repetition,
: func()
: {
: [ ${#1} -eq 4 ]
: }
: set -- `tr -c '0-9' ' ' < file`
: echo ${@|?func}

But the challenge is NOT to invoke external programs (;-)

Dan Mercer

:
: --
: William Park <opengeometry@yahoo.ca>, Toronto, Canada
: ThinFlash: Linux thin-client on USB key (flash) drive
: http://home.eol.ca/~parkw/thinflash.html
: BashDiff: Super Bash shell
: http://freshmeat.net/projects/bashdiff/


Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com