Unix administration - split by regex?

This is Interesting: Free IT Magazines  
Home > Archive > Unix administration > October 2004 > split by regex?





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author split by regex?
robert

2004-10-15, 9:29 pm

Hello all.

Does there exist a utility similar to split which will split by regex?

I'm trying to massage some data gathered from varied sources in preparation
for inputting it into a database. Currently the records are in a single
file.

Picture a large flat-file database that has records divided by a certain
known token pattern. 99 percent of these records are say, five lines
long, with the rest of varying length because of some accident in creation
of the file. I don't know which ones are wrong, so I would like to split
along the separator tokens and then wc -l on the split output files so I can
readily see which records are broken.

Thanks.
-Robert

Stephane CHAZELAS

2004-10-15, 9:29 pm

2004-10-13, 03:05(-05), robert:
> Hello all.
>
> Does there exist a utility similar to split which will split by regex?
>
> I'm trying to massage some data gathered from varied sources in preparation
> for inputting it into a database. Currently the records are in a single
> file.

[...]

With nawk you can use a regexp for FS.

With GNU awk, I think, you can also use a regexp for RS.

Or you can use perl.

--
Stephane
Liam Cunningham

2004-10-15, 9:29 pm

On Wed, 13 Oct 2004 03:05:02 -0500, robert wrote:

> Hello all.
>
> Does there exist a utility similar to split which will split by regex?
>

[snip]
>
> Thanks.
> -Robert

Try csplit (ie Context Split). The man pages should help with usage.


--

If at first you don't succeed,
read the manual......

William Park

2004-10-15, 9:29 pm

robert <root@wheel.invalid> wrote:
> Hello all.
>
> Does there exist a utility similar to split which will split by regex?
>
> I'm trying to massage some data gathered from varied sources in
> preparation for inputting it into a database. Currently the records
> are in a single file.
>
> Picture a large flat-file database that has records divided by a
> certain known token pattern. 99 percent of these records are say,
> five lines long, with the rest of varying length because of some
> accident in creation of the file. I don't know which ones are wrong,
> so I would like to split along the separator tokens and then wc -l on
> the split output files so I can readily see which records are broken.


1. awk -v FS=... -v RS=...

2. csplit

3. Read the entire file into a string, and cut/slice based on regex.
You can use Python, Perl, or patched Bash shell
(freshmeat.net/projects/bashdiff).

--
William Park <opengeometry@yahoo.ca>
Open Geometry Consulting, Toronto, Canada
robert

2004-10-15, 9:29 pm

begin Liam Cunningham <liam@consumercontact.com> wrote:
> On Wed, 13 Oct 2004 03:05:02 -0500, robert wrote:
>
> [snip]
> Try csplit (ie Context Split). The man pages should help with usage.
>
>



THANK YOU!!!
csplit works perfectly for my needs.

robert

2004-10-15, 9:29 pm

begin William Park <opengeometry@yahoo.ca> wrote:
> robert <root@wheel.invalid> wrote:
....[vbcol=seagreen]
>
> 1. awk -v FS=... -v RS=...
>
> 2. csplit
>
> 3. Read the entire file into a string, and cut/slice based on regex.
> You can use Python, Perl, or patched Bash shell
> (freshmeat.net/projects/bashdiff).
>



Thanks, William.
My first thought was reading into a C string and chopping it up, but I didn't
want to reinvent the wheel. csplit is exactly what I was looking for.
I may need more than that in the future though, so thanks for the awk FS/RS
pointer.

BTW, bashdiff sounds awesome. I'm going to try it out next week when I get
some free time. I'm currently using a lot of bash scripts that call psql,
so the bashdiff builtin PostgreSQL operations sound particularly exciting
to me.

robert

2004-10-15, 9:29 pm

begin Stephane CHAZELAS <this.address@is.invalid> wrote:
> 2004-10-13, 03:05(-05), robert:
> [...]
>
> With nawk you can use a regexp for FS.
>
> With GNU awk, I think, you can also use a regexp for RS.
>
> Or you can use perl.
>



Thanks, Stephane.
csplit appears to meet my needs for now.

Coming from a mostly C background, I've never really been able to wrap
my brain around perl. That said, ironically most of my trivial regex
work I do with simple PERL scripts. I've never used awk, but the
FS/RS stuff looks promising.

Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com