|
Home > Archive > Unix Shell > December 2007 > Help extracting strings via awk.
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Help extracting strings via awk.
|
|
| oleg.rakhmanchik@gmail.com 2007-12-06, 7:25 pm |
| Hi,
I need help extracting urls from a large text file. I don't have
control over the format of the file so it is always different, but the
urls are always in <url>..</url> tags. The text is always on the same
line without line breaks.
asdfsdf<url>www.google.com</url>dfgdgdg<url>www.yahoo.com</url>asd
adfsdf sd sdgdfg<url>...
The surrounding text is always different and I need the quickest and
most efficient way to extract just the text between the 2 tags and
output it somewhere. Right now I do this with several commands and it
takes a while for a large file, but I know there is probably a quicker
and better way to do this. Please help.
| |
| Ed Morton 2007-12-07, 1:37 am |
|
On 12/6/2007 7:20 PM, oleg.rakhmanchik@gmail.com wrote:
> Hi,
>
> I need help extracting urls from a large text file. I don't have
> control over the format of the file so it is always different, but the
> urls are always in <url>..</url> tags. The text is always on the same
> line without line breaks.
>
> asdfsdf<url>www.google.com</url>dfgdgdg<url>www.yahoo.com</url>asd
> adfsdf sd sdgdfg<url>...
>
> The surrounding text is always different and I need the quickest and
> most efficient way to extract just the text between the 2 tags and
> output it somewhere. Right now I do this with several commands and it
> takes a while for a large file, but I know there is probably a quicker
> and better way to do this. Please help.
With GNU awk:
gawk -F'<url>' -v RS='</url>' 'RT{print $NF}' file
Ed.
| |
| Steffen Schuler 2007-12-07, 1:37 am |
| On Thu, 06 Dec 2007 20:59:01 -0600, Ed Morton wrote:
> On 12/6/2007 7:20 PM, oleg.rakhmanchik@gmail.com wrote:
>
> With GNU awk:
>
> gawk -F'<url>' -v RS='</url>' 'RT{print $NF}' file
>
> Ed.
With Perl:
perl -ne 'for (/<url>(.*?)<\/url>/g) {print "$_\n"}' file
Regards,
Steffen "goedel" Schuler
| |
| oleg.rakhmanchik@gmail.com 2007-12-07, 1:29 pm |
| On Dec 6, 9:59 pm, Ed Morton <mor...@lsupcaemnt.com> wrote:
> On 12/6/2007 7:20 PM, oleg.rakhmanc...@gmail.com wrote:
>
>
>
>
>
> With GNU awk:
>
> gawk -F'<url>' -v RS='</url>' 'RT{print $NF}' file
>
> Ed.
These work perfectly, thank you.
| |
| John W. Krahn 2007-12-07, 7:23 pm |
| Steffen Schuler wrote:
>
> On Thu, 06 Dec 2007 20:59:01 -0600, Ed Morton wrote:
>
>
> With Perl:
>
> PERL -ne 'for (/<url>(.*?)<\/url>/g) {print "$_\n"}' file
The PERL version of that gawk program would be:
perl -F'<url>' -lane'BEGIN{$/="</url>"} print $F[-1]' file
John
--
use Perl;
program
fulfillment
| |
| William James 2007-12-08, 1:45 am |
|
oleg.rakhmanc...@gmail.com wrote:
> Hi,
>
> I need help extracting urls from a large text file. I don't have
> control over the format of the file so it is always different, but the
> urls are always in <url>..</url> tags. The text is always on the same
> line without line breaks.
>
> asdfsdf<url>www.google.com</url>dfgdgdg<url>www.yahoo.com</url>asd
> adfsdf sd sdgdfg<url>...
>
> The surrounding text is always different and I need the quickest and
> most efficient way to extract just the text between the 2 tags and
> output it somewhere. Right now I do this with several commands and it
> takes a while for a large file, but I know there is probably a quicker
> and better way to do this. Please help.
ruby -ne 'x=nil; puts split(/<\/?url>/).reject{x=!x}' myfile
| |
| William James 2007-12-08, 1:45 am |
| On Dec 7, 8:14 pm, William James <w_a_x_...@yahoo.com> wrote:
> oleg.rakhmanc...@gmail.com wrote:
>
>
[vbcol=seagreen]
> ruby -ne 'x=nil; puts split(/<\/?url>/).reject{x=!x}' myfile
After splitting, print fields 1, 3, etc. (0 is the first
field).
ruby -ne 'puts split(/<\/?url>/).select{$_=!$_}' myfile
|
|
|
|
|