Unix Shell - Parse irregular data, dump into delimited text file

This is Interesting: Free IT Magazines  
Home > Archive > Unix Shell > November 2005 > Parse irregular data, dump into delimited text file





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Parse irregular data, dump into delimited text file
d@rren.cymraeg.org

2005-11-27, 5:53 pm

I've been given an MS Word document containing information to input into
a database. I've knocked it into shape using various unix tools eg.
sed, cut etc. so now I have data in a plain text file like this :

name|address|postcode|telephone

The address field contains data in an irregular form, eg.
12, the high street, town, place, biggerplace
The vicarage, town place

I need to be able to format address field above ready for importing
into another database. In this new database, I have 3 fields for
address (address1, address2, address3).

So my problem is how to cut this address data and then put it back in a
text file with delimiter. address3 should contain only one word,
however, address1 and address2 may contain more than one word. When
filling in the fields, data should be added from left to right, or in
the order address1 then address2 then address3. If there is an address
field left with a blank, that is not a problem as it will be handled by
the mailmerge software.

Help on this much appreciated.

Darren

--
Aberdare Blog http://blog.aberdare.org/ Only Blog In The Valley
Ed Morton

2005-11-27, 5:53 pm

d@rren.cymraeg.org wrote:
> I've been given an MS Word document containing information to input into
> a database. I've knocked it into shape using various unix tools eg.
> sed, cut etc. so now I have data in a plain text file like this :
>
> name|address|postcode|telephone
>
> The address field contains data in an irregular form, eg.
> 12, the high street, town, place, biggerplace
> The vicarage, town place
>
> I need to be able to format address field above ready for importing
> into another database. In this new database, I have 3 fields for
> address (address1, address2, address3).
>
> So my problem is how to cut this address data and then put it back in a
> text file with delimiter. address3 should contain only one word,
> however, address1 and address2 may contain more than one word. When
> filling in the fields, data should be added from left to right, or in
> the order address1 then address2 then address3. If there is an address
> field left with a blank, that is not a problem as it will be handled by
> the mailmerge software.
>
> Help on this much appreciated.
>
> Darren
>


Let's start with this:

awk 'BEGIN{FS=OFS="|"}{
c = split($2,addr,",")
addr3 = addr[c--]
addr2 = addr[c--]
addr1 = ""
for (i=1;i<=c;i++) {
addr1 = addr1 addr[i]
}
$2 = addr1 "," addr2 "," addr3
}1' file

Follow up with some sample input, expected output and rationale for any
problems with the above.

Ed.
Bill Marcum

2005-11-27, 5:53 pm

On 27 Nov 2005 18:10:59 GMT, d@rren.cymraeg.org
<d@rren.cymraeg.org> wrote:
> I've been given an MS Word document containing information to input into
> a database. I've knocked it into shape using various unix tools eg.
> sed, cut etc. so now I have data in a plain text file like this :
>
> name|address|postcode|telephone
>
> The address field contains data in an irregular form, eg.
> 12, the high street, town, place, biggerplace
> The vicarage, town place
>
> I need to be able to format address field above ready for importing
> into another database. In this new database, I have 3 fields for
> address (address1, address2, address3).
>
> So my problem is how to cut this address data and then put it back in a
> text file with delimiter. address3 should contain only one word,
> however, address1 and address2 may contain more than one word. When
> filling in the fields, data should be added from left to right, or in
> the order address1 then address2 then address3. If there is an address
> field left with a blank, that is not a problem as it will be handled by
> the mailmerge software.
>
> Help on this much appreciated.
>

How do you define address1 and address2?

--
Higher education helps your earning capacity. Ask any college professor.
d@rren.cymraeg.org

2005-11-27, 5:53 pm

Bill Marcum <bmarcum@iglou.com> wrote:

> How do you define address1 and address2?


Well, address1, address2, and address3 may contain anything one would
normally associate with an address.

--
Aberdare Blog http://blog.aberdare.org/ Only Blog In The Valley
d@rren.cymraeg.org

2005-11-27, 5:53 pm

Ed Morton <morton@lsupcaemnt.com> wrote:

> Let's start with this:
>
> awk 'BEGIN{FS=OFS="|"}{
> c = split($2,addr,",")
> addr3 = addr[c--]
> addr2 = addr[c--]
> addr1 = ""
> for (i=1;i<=c;i++) {
> addr1 = addr1 addr[i]
> }
> $2 = addr1 "," addr2 "," addr3
> }1' file
>
> Follow up with some sample input, expected output and rationale for any
> problems with the above.


Sample input as per my original post.

The output I get using the above is :

1, A Streetname, Blahblah, Foofoo|,,

Don't assume I know anything awk.

Darren

--
Aberdare Blog http://blog.aberdare.org/ Only Blog In The Valley
Ed Morton

2005-11-27, 5:53 pm

d@rren.cymraeg.org wrote:
> Ed Morton <morton@lsupcaemnt.com> wrote:
>
>
>
>
> Sample input as per my original post.


You didn't have any sample input in your original post. All you had was
an example of the address field:

12, the high street, town, place, biggerplace
The vicarage, town place

> The output I get using the above is :
>
> 1, A Streetname, Blahblah, Foofoo|,,


That seems unlikely at best if the above really were the address fields
you used in your sample input. Assuming you actually had some other
input that produced that output - is the above output what you wanted or
not?

> Don't assume I know anything awk.


awk is simple. If you have any experience with an ALGOL-based language
(e.g. Pascal, Ada, C), you'll be able to understand and modify any awk
solution posted. Don't assume we know anything about your actual input
or expected output. So far you haven't posted any complete sample input
or output records, described the output field separators (input
separators for your new database), or provided any other details.

Ed.

d@rren.cymraeg.org

2005-11-27, 5:53 pm

Ed Morton <morton@lsupcaemnt.com> wrote:

>
> You didn't have any sample input in your original post. All you had was
> an example of the address field:


Okay...

> 12, the high street, town, place, biggerplace
> The vicarage, town place


I have records like the following ...

name1|12, the high street, town, place, biggerplace|postcode1|telephone1
name2|The vicarage, town, place|postcode2|telephone2

My problem is breaking up field two above into useful chunks and
slotting them into three other fields.

>
> That seems unlikely


Well, unlikely or not, that is the output. If you want the exact data I
have, don't bother replying as I can't print that.

Darren

--
Aberdare Blog http://blog.aberdare.org/ Only Blog In The Valley
Janis Papanagnou

2005-11-27, 5:53 pm

d@rren.cymraeg.org wrote:
> Ed Morton <morton@lsupcaemnt.com> wrote:
>
>
> Okay...
>
>
> I have records like the following ...
>
> name1|12, the high street, town, place, biggerplace|postcode1|telephone1
> name2|The vicarage, town, place|postcode2|telephone2
>
> My problem is breaking up field two above into useful chunks and
> slotting them into three other fields.
>

Hmm, that should not be the case. Are you executing the awk program from
within a Unix shell?

Ed's program produces with your data...

name1|12 the high street town, place, biggerplace|postcode1|telephone1
name2|The vicarage, town, place|postcode2|telephone2

If you want '|' instead of the ',' change the respective program line to

$2 = addr1 "|" addr2 "|" addr3
[vbcol=seagreen]
>
> Well, unlikely or not, that is the output. If you want the exact data I
> have, don't bother replying as I can't print that.


If you get the wrong output, try putting the awk program

BEGIN{FS=OFS="|"}{
c = split($2,addr,",")
addr3 = addr[c--]
addr2 = addr[c--]
addr1 = ""
for (i=1;i<=c;i++) {
addr1 = addr1 addr[i]
}
$2 = addr1 "|" addr2 "|" addr3
}1

in an own file "yourfile" and call it from the shell (works also on WinDOS
shells:

awk -f yourfile yourdata


Janis
William James

2005-11-27, 5:53 pm

d@rren.cymraeg.org wrote:
> I've been given an MS Word document containing information to input into
> a database. I've knocked it into shape using various unix tools eg.
> sed, cut etc. so now I have data in a plain text file like this :
>
> name|address|postcode|telephone
>
> The address field contains data in an irregular form, eg.
> 12, the high street, town, place, biggerplace
> The vicarage, town place
>
> I need to be able to format address field above ready for importing
> into another database. In this new database, I have 3 fields for
> address (address1, address2, address3).
>
> So my problem is how to cut this address data and then put it back in a
> text file with delimiter. address3 should contain only one word,
> however, address1 and address2 may contain more than one word. When
> filling in the fields, data should be added from left to right, or in
> the order address1 then address2 then address3. If there is an address
> field left with a blank, that is not a problem as it will be handled by
> the mailmerge software.


This input

name1|12, high street, town, place, biggerplace|postcode1|telephone1
name2|The vicarage, town, place|postcode2|telephone2
name3|The vicarage, place|postcode2|telephone2

produces this output

name1|12 high street|town, place|biggerplace|postcode1|telephone1
name2|The vicarage|town|place|postcode2|telephone2

name3|The vicarage|place||postcode2|telephone2

The language is Ruby:

# Read each line of the file given on the command line.
ARGF.each { |line|
# Remove the newline at the end of the string.
line.chomp!

# Split the string into an array.
array = line.split( "|" )

# Split the address field on commas, removing surrounding
# whitespace.
address = array[1].split( /\s*,\s*/ )

# I'm assuming that if the first part of the address is
# a number, it should be combined with the next part.
if address[0] =~ /^\d+$/
address[0..1] = address[0..1].join( " " )
end

if address.size > 3
# Combine all but the first and last part into one part.
address[1..-2] = address[1..-2].join( ", " )
else
# If we have fewer than 3 parts, tack on an empty string.
address.push( "" ) while address.size < 3
end

array[1] = address

# Assuming that the ouput field-separator is "|".
puts array.flatten.join( "|" )
}

Ed Morton

2005-11-27, 5:53 pm

d@rren.cymraeg.org wrote:
> Ed Morton <morton@lsupcaemnt.com> wrote:
>
>
>
>
> Okay...
>
>
>
>
> I have records like the following ...
>
> name1|12, the high street, town, place, biggerplace|postcode1|telephone1
> name2|The vicarage, town, place|postcode2|telephone2
>
> My problem is breaking up field two above into useful chunks and
> slotting them into three other fields.


I gave you a script to do that.

>
>
>
> Well, unlikely or not, that is the output.


That is not the output from the input you posted.

If you want the exact data I
> have, don't bother replying as I can't print that.


What I was looking for was a sample input and expected output FOR THAT
INPUT so I could help you. You've now finally provided some sample input
but you still haven't provided the expected output or even told us if
the script I posted earlier produces the output you wanted or, if not,
what it is you'd like to see different.

You seem to be developing an attitude so I appologise for bothering you
and I'll now take your advice and stop replying. Good luck...

Ed.
Mark Hobley

2005-11-27, 5:53 pm

d@rren.cymraeg.org wrote:

> Well, unlikely or not, that is the output. If you want the exact data I
> have, don't bother replying as I can't print that.


You are telling us that you have abstract input and giving us examples of
abstract output.

If we can't see the input or output, it makes it very difficult to help you.

You could make up a couple of false non-existing test addresses that do not
parse properly, then post the input and the resulting output, and the output
that you want to get.

That way it would be clear to us all, and we can advise you on the best
solution.

Regards,

Mark.

--
Mark Hobley
393 Quinton Road West
QUINTON
Birmingham
B32 1QE

Telephone: (0121) 247 1596
International: 0044 121 247 1596

Email: markhobley at hotpop dot donottypethisbit com

http://markhobley.yi.org/

d@rren.cymraeg.org

2005-11-28, 7:50 am

Many thanks to William James for the ruby code which does the job.

Darren

--
Aberdare Blog http://blog.aberdare.org/ Only Blog In The Valley
Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com