| Author |
Extract a substring of n digits from a string
|
|
| Giacomo 2005-10-26, 8:51 pm |
| I need to extract a substring of n adjacent digits from every single
line of a file. The position of the n digits are different from line to
line.
For example:
asdasd 123 asd 191991 1234
lijoioi 4567 asdi 67567 iojoii
For n=4 the result for each line must be 1234 e 4567.
Thanks in advance,
Giacomo.
| |
| Janis Papanagnou 2005-10-26, 8:51 pm |
| Giacomo wrote:
> I need to extract a substring of n adjacent digits from every single
> line of a file. The position of the n digits are different from line to
> line.
What type of shell or programs you do have to use?
What have you tried to program thus far?
General outline, for example...
Depending on whether the shell/tool/program supports extended regular
expressions or not you have to either define a regexp like [0-9]{n} or
construct one from n sequences of [0-9]. This regexp must be embedded
within white space [ \t] or non-numerical patterns [^0-9] depending on
your requirements. Take care of the line boundaries, so you'll likely
have to consider start of line ^ for the left and end of line $ for
the right boundary. Finally extract the substring from the matching
part. Consider to add spaces to the front and read of the input line
to simplify the matching and extraction of the substring pattern.
> For example:
>
> asdasd 123 asd 191991 1234
> lijoioi 4567 asdi 67567 iojoii
>
> For n=4 the result for each line must be 1234 e 4567.
Janis
| |
| William James 2005-10-26, 8:51 pm |
|
Giacomo wrote:
> I need to extract a substring of n adjacent digits from every single
> line of a file. The position of the n digits are different from line to
> line.
>
> For example:
>
> asdasd 123 asd 191991 1234
> lijoioi 4567 asdi 67567 iojoii
>
> For n=4 the result for each line must be 1234 e 4567.
>
> Thanks in advance,
> Giacomo.
ruby -ne 'puts $1 if /(?:^|\D)(\d{4})(?!\d)/'
| |
| William Park 2005-10-27, 2:48 am |
| Giacomo <a@b.cde> wrote:
> I need to extract a substring of n adjacent digits from every single
> line of a file. The position of the n digits are different from line to
> line.
>
> For example:
>
> asdasd 123 asd 191991 1234
> lijoioi 4567 asdi 67567 iojoii
>
> For n=4 the result for each line must be 1234 e 4567.
a='asdasd 123 asd 191991 1234 lijoioi 4567 asdi 67567 iojoii'
RE='\<[0-9]{4}\>'
echo "${a|+$RE}"
Ref:
http://home.eol.ca/~parkw/index.htm...meter_expansion
--
William Park <opengeometry@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
| |
| Ed Morton 2005-10-27, 2:48 am |
| Giacomo wrote:
> I need to extract a substring of n adjacent digits from every single
> line of a file. The position of the n digits are different from line to
> line.
>
> For example:
>
> asdasd 123 asd 191991 1234
> lijoioi 4567 asdi 67567 iojoii
>
> For n=4 the result for each line must be 1234 e 4567.
>
> Thanks in advance,
> Giacomo.
Using a POSIX awk:
awk '{for (i=1;i<=NF;i++) if ($i ~ /^[0-9]{4}$/) print $i}'
To get GNU awk (gawk) to behave like that, use
awk --posix ... or awk --re-interval ....
There are cuter ways to get the same result in awk, but this is the
simplest and most obvious.
Regards,
Ed.
| |
| Giacomo 2005-10-28, 4:53 pm |
| Janis Papanagnou wrote:
> What type of shell or programs you do have to use?
GNU bash, version 3.00.16(1)-release (i386-pc-linux-gnu)
> What have you tried to program thus far?
I tried "expr", but I think it can't work.
Giacomo.
| |
| Stephane Chazelas 2005-10-28, 4:53 pm |
| On Thu, 27 Oct 2005 01:09:37 +0200, Giacomo wrote:
> I need to extract a substring of n adjacent digits from every single
> line of a file. The position of the n digits are different from line to
> line.
>
> For example:
>
> asdasd 123 asd 191991 1234
> lijoioi 4567 asdi 67567 iojoii
>
> For n=4 the result for each line must be 1234 e 4567.
[...]
n=4
sed -n "s/.*/+&+/
s/.*[^0-9]\([0-9]\{$n\}\)[^0-9].*/\1/p" < file
That returns the right-most sequence of 4 digits.
--
Stephane
| |
| Dave Gibson 2005-10-28, 4:53 pm |
| Giacomo <a@b.cde> wrote:
> I need to extract a substring of n adjacent digits from every single
> line of a file. The position of the n digits are different from line to
> line.
>
> For example:
>
> asdasd 123 asd 191991 1234
> lijoioi 4567 asdi 67567 iojoii
>
> For n=4 the result for each line must be 1234 e 4567.
>
> Thanks in advance,
> Giacomo.
#! /usr/bin/awk -f
BEGIN {
sequence = "0123456789"
reqlen = count > 0 ? count : 4
for (i = 1; i <= reqlen; i++)
pat = pat "[0-9]"
}
$0 ~ pat {
for (i = 1; i <= NF; i++) {
s = $i
while (s ~ pat) {
p = substr(s, 1, reqlen)
if (index(sequence, p)) {
# printf "%s, line %d (field %d): %s\n", FILENAME, FNR, i, $i
print p
next
}
s = substr(s, 2)
}
}
}
Copy that into a file and use it like this (replace "script.awk" with
whatever you name it):
awk -v count=4 -f script.awk your_data.files
If you don't ask for a specific sequence length (with -v count=N) the
script assumes 4.
| |
| Dan Mercer 2005-10-28, 4:53 pm |
| "William Park" <opengeometry@yahoo.ca> wrote in message news:f2abd$43603ea4$d8fe9d17$6594@PRIMUS
.CA...
: Giacomo <a@b.cde> wrote:
: > I need to extract a substring of n adjacent digits from every single
: > line of a file. The position of the n digits are different from line to
: > line.
: >
: > For example:
: >
: > asdasd 123 asd 191991 1234
: > lijoioi 4567 asdi 67567 iojoii
: >
: > For n=4 the result for each line must be 1234 e 4567.
:
: a='asdasd 123 asd 191991 1234 lijoioi 4567 asdi 67567 iojoii'
: RE='\<[0-9]{4}\>'
: echo "${a|+$RE}"
This can be down in the shell. Assuming lines containing only lower case letters and
numbers:
ifs=$IFS
IFS="${IFS}abcdefghijklmnopqrstuvwxyz"
while read line
do
IFS=$ifs
set -- $line
IFS=$ifs
set -- $*
for i
do
((${#i} == 4)) && echo "$i"
done
done
Now, this isn't the efficient way to do it. That would probably be
best done in perl.
Dan Mercer
:
: Ref:
: http://home.eol.ca/~parkw/index.htm...meter_expansion
:
: --
: William Park <opengeometry@yahoo.ca>, Toronto, Canada
: ThinFlash: Linux thin-client on USB key (flash) drive
: http://home.eol.ca/~parkw/thinflash.html
: BashDiff: Super Bash shell
: http://freshmeat.net/projects/bashdiff/
| |
| William Park 2005-10-28, 4:53 pm |
| Dan Mercer <dmercer@mn.rr.com> wrote:
> "William Park" <opengeometry@yahoo.ca> wrote in message news:f2abd$43603ea4$d8fe9d17$6594@PRIMUS
.CA...
> : Giacomo <a@b.cde> wrote:
> : > asdasd 123 asd 191991 1234
> : > lijoioi 4567 asdi 67567 iojoii
> : >
> : > For n=4 the result for each line must be 1234 e 4567.
> :
> : a='asdasd 123 asd 191991 1234 lijoioi 4567 asdi 67567 iojoii'
> : RE='\<[0-9]{4}\>'
> : echo "${a|+$RE}"
>
> This can be down in the shell. Assuming lines containing only lower case letters and
> numbers:
>
> ifs=$IFS
> IFS="${IFS}abcdefghijklmnopqrstuvwxyz"
> while read line
> do
> IFS=$ifs
> set -- $line
> IFS=$ifs
> set -- $*
> for i
> do
> ((${#i} == 4)) && echo "$i"
> done
> done
>
> Now, this isn't the efficient way to do it. That would probably be
> best done in perl.
Interesting approach. I would probably do it as
for i in `tr -c '0-9' ' ' < file`; do
[ ${#i} -eq 4 ] && echo $i
done
or if there is lots of repetition,
func()
{
[ ${#1} -eq 4 ]
}
set -- `tr -c '0-9' ' ' < file`
echo ${@|?func}
--
William Park <opengeometry@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
| |
| Dan Mercer 2005-10-28, 8:48 pm |
|
"William Park" <opengeometry@yahoo.ca> wrote in message news:5202b$43617e07$d8fea14c$3893@PRIMUS
.CA...
: Dan Mercer <dmercer@mn.rr.com> wrote:
: > "William Park" <opengeometry@yahoo.ca> wrote in message news:f2abd$43603ea4$d8fe9d17$6594@PRIMUS
.CA...
: > : Giacomo <a@b.cde> wrote:
: > : > asdasd 123 asd 191991 1234
: > : > lijoioi 4567 asdi 67567 iojoii
: > : >
: > : > For n=4 the result for each line must be 1234 e 4567.
: > :
: > : a='asdasd 123 asd 191991 1234 lijoioi 4567 asdi 67567 iojoii'
: > : RE='\<[0-9]{4}\>'
: > : echo "${a|+$RE}"
: >
: > This can be down in the shell. Assuming lines containing only lower case letters and
: > numbers:
: >
: > ifs=$IFS
: > IFS="${IFS}abcdefghijklmnopqrstuvwxyz"
: > while read line
: > do
: > IFS=$ifs
: > set -- $line
: > IFS=$ifs
: > set -- $*
: > for i
: > do
: > ((${#i} == 4)) && echo "$i"
: > done
: > done
: >
: > Now, this isn't the efficient way to do it. That would probably be
: > best done in perl.
:
: Interesting approach. I would probably do it as
: for i in `tr -c '0-9' ' ' < file`; do
: [ ${#i} -eq 4 ] && echo $i
: done
: or if there is lots of repetition,
: func()
: {
: [ ${#1} -eq 4 ]
: }
: set -- `tr -c '0-9' ' ' < file`
: echo ${@|?func}
But the challenge is NOT to invoke external programs (;-)
Dan Mercer
:
: --
: William Park <opengeometry@yahoo.ca>, Toronto, Canada
: ThinFlash: Linux thin-client on USB key (flash) drive
: http://home.eol.ca/~parkw/thinflash.html
: BashDiff: Super Bash shell
: http://freshmeat.net/projects/bashdiff/
|
|
|
|