|
Home > Archive > Unix administration > October 2004 > split by regex?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| robert 2004-10-15, 9:29 pm |
| Hello all.
Does there exist a utility similar to split which will split by regex?
I'm trying to massage some data gathered from varied sources in preparation
for inputting it into a database. Currently the records are in a single
file.
Picture a large flat-file database that has records divided by a certain
known token pattern. 99 percent of these records are say, five lines
long, with the rest of varying length because of some accident in creation
of the file. I don't know which ones are wrong, so I would like to split
along the separator tokens and then wc -l on the split output files so I can
readily see which records are broken.
Thanks.
-Robert
| |
| Stephane CHAZELAS 2004-10-15, 9:29 pm |
| 2004-10-13, 03:05(-05), robert:
> Hello all.
>
> Does there exist a utility similar to split which will split by regex?
>
> I'm trying to massage some data gathered from varied sources in preparation
> for inputting it into a database. Currently the records are in a single
> file.
[...]
With nawk you can use a regexp for FS.
With GNU awk, I think, you can also use a regexp for RS.
Or you can use perl.
--
Stephane
| |
| Liam Cunningham 2004-10-15, 9:29 pm |
| On Wed, 13 Oct 2004 03:05:02 -0500, robert wrote:
> Hello all.
>
> Does there exist a utility similar to split which will split by regex?
>
[snip]
>
> Thanks.
> -Robert
Try csplit (ie Context Split). The man pages should help with usage.
--
If at first you don't succeed,
read the manual......
| |
| William Park 2004-10-15, 9:29 pm |
| robert <root@wheel.invalid> wrote:
> Hello all.
>
> Does there exist a utility similar to split which will split by regex?
>
> I'm trying to massage some data gathered from varied sources in
> preparation for inputting it into a database. Currently the records
> are in a single file.
>
> Picture a large flat-file database that has records divided by a
> certain known token pattern. 99 percent of these records are say,
> five lines long, with the rest of varying length because of some
> accident in creation of the file. I don't know which ones are wrong,
> so I would like to split along the separator tokens and then wc -l on
> the split output files so I can readily see which records are broken.
1. awk -v FS=... -v RS=...
2. csplit
3. Read the entire file into a string, and cut/slice based on regex.
You can use Python, Perl, or patched Bash shell
(freshmeat.net/projects/bashdiff).
--
William Park <opengeometry@yahoo.ca>
Open Geometry Consulting, Toronto, Canada
| |
| robert 2004-10-15, 9:29 pm |
| begin Liam Cunningham <liam@consumercontact.com> wrote:
> On Wed, 13 Oct 2004 03:05:02 -0500, robert wrote:
>
> [snip]
> Try csplit (ie Context Split). The man pages should help with usage.
>
>
THANK YOU!!!
csplit works perfectly for my needs.
| |
| robert 2004-10-15, 9:29 pm |
| begin William Park <opengeometry@yahoo.ca> wrote:
> robert <root@wheel.invalid> wrote:
....[vbcol=seagreen]
>
> 1. awk -v FS=... -v RS=...
>
> 2. csplit
>
> 3. Read the entire file into a string, and cut/slice based on regex.
> You can use Python, Perl, or patched Bash shell
> (freshmeat.net/projects/bashdiff).
>
Thanks, William.
My first thought was reading into a C string and chopping it up, but I didn't
want to reinvent the wheel. csplit is exactly what I was looking for.
I may need more than that in the future though, so thanks for the awk FS/RS
pointer.
BTW, bashdiff sounds awesome. I'm going to try it out next week when I get
some free time. I'm currently using a lot of bash scripts that call psql,
so the bashdiff builtin PostgreSQL operations sound particularly exciting
to me.
| |
| robert 2004-10-15, 9:29 pm |
| begin Stephane CHAZELAS <this.address@is.invalid> wrote:
> 2004-10-13, 03:05(-05), robert:
> [...]
>
> With nawk you can use a regexp for FS.
>
> With GNU awk, I think, you can also use a regexp for RS.
>
> Or you can use perl.
>
Thanks, Stephane.
csplit appears to meet my needs for now.
Coming from a mostly C background, I've never really been able to wrap
my brain around perl. That said, ironically most of my trivial regex
work I do with simple PERL scripts. I've never used awk, but the
FS/RS stuff looks promising.
|
|
|
|
|