8

I'm trying to parse a test file. the file has username, address and phone in the following format:

Name: John Doe1
address : somewhere
phone: 123-123-1234

Name: John Doe2
address : somewhere
phone: 123-123-1233

Name: John Doe3
address : somewhere
phone: 123-123-1232

Only for almost 10k users: ) what I would like to do is convert those rows to columns, for example:

Name: John Doe1                address : somewhere          phone: 123-123-1234
Name: John Doe2                address : somewhere          phone: 123-123-1233
Name: John Doe3                address : somewhere          phone: 123-123-1232

I would prefer to do it in bash but if you know how to do it in python that would be great too, the file that has this information is in /root/docs/information. Any tips or help would be much appreciated.

2
  • Good initial question, @tafiela. But, don't forget to point in the next questions what you have tried to do. Commented Oct 11, 2012 at 3:25
  • Is the address really only one line after the colon? Commented Oct 11, 2012 at 3:46

11 Answers 11

5

One way with GNU awk:

awk 'BEGIN { FS="\n"; RS=""; OFS="\t\t" } { print $1, $2, $3 }' file.txt

Results:

Name: John Doe1     address : somewhere     phone: 123-123-1234
Name: John Doe2     address : somewhere     phone: 123-123-1233
Name: John Doe3     address : somewhere     phone: 123-123-1232

Note that, I've set the output file separator (OFS) to two tab characters (\t\t). You can change this to whatever character or set of characters you please. HTH.

Sign up to request clarification or add additional context in comments.

1 Comment

@VictorHugo: RS is short for record separator. By default RS is set to \n or newline. This allows awk to process the file line by line. When we set it to nothing (or ""), we're actually changing awk's definition of a line. Since each of the records are separated by empty lines, setting RS="" makes for an easy solution. HTH.
3

With a short Perl one-liner :

$ perl -ne 'END{print "\n"}chomp; /^$/ ? print "\n" : print "$_\t\t"' file.txt

OUTPUT

Name: John Doe1         address : somewhere             phone: 123-123-1234
Name: John Doe2         address : somewhere             phone: 123-123-1233
Name: John Doe3         address : somewhere             phone: 123-123-1232

Comments

2

Using paste, we can join the lines in the file:

$ paste -s -d"\t\t\t\n" file
Name: John Doe1 address : somewhere     phone: 123-123-1234
Name: John Doe2 address : somewhere     phone: 123-123-1233
Name: John Doe3 address : somewhere     phone: 123-123-1232

2 Comments

@sputnick True, but this does the hard part. There are myriad utilities to expand tabs.
Yes, but in this case, you need 2 pipes ;)
1

This seems to do basically what you want:

information = 'information'  # file path

with open(information, 'rt') as input:
    data = input.read()

data = data.split('\n\n')

for group in data:
    print group.replace('\n', '     ')

Output:

Name: John Doe1     address : somewhere     phone: 123-123-1234
Name: John Doe2     address : somewhere     phone: 123-123-1233
Name: John Doe3     address : somewhere     phone: 123-123-1232     

Comments

1

I know you did not mention awk, but it solves your problem nicely:

awk 'BEGIN {RS="";FS="\n"} {print $1,$2,$3}' data.txt

Comments

1

Most of the solutions here are just reformatting the data in the file that you are reading. Maybe that is all that you want.

If you actually want to parse the data, put it in a data structure.

This example in Python:

data="""\
Name: John Doe2
address : 123 Main St, Los Angeles, CA 95002
phone: 213-123-1234

Name: John Doe1
address : 145 Pearl St, La Jolla, CA 92013
phone: 858-123-1233

Name: Billy Bob Doe3
address : 454 Heartland St, Mobile, AL 00103
phone: 205-123-1232""".split('\n\n')      # just a fill-in for your file
                                          # you would use `with open(file) as data:`

addr={}
w0,w1,w2=0,0,0             # these keep track of the max width of the field 
for line in data:
    fields=[e.split(':')[1].strip() for e in [f for f in line.split('\n')]]
    nam=fields[0].split()
    name=nam[-1]+', '+' '.join(nam[0:-1])
    addr[(name,fields[2])]=fields
    w0,w1,w2=[max(t) for t in zip(map(len,fields),(w0,w1,w2))]

Now you have the freedom to sort, change the format, put in database, etc.

This prints your format with that data, sorted:

for add in sorted(addr.keys()):
    print 'Name: {0:{w0}} Address: {1:{w1}} phone: {2:{w2}}'.format(*addr[add],w0=w0,w1=w1,w2=w2)

Prints:

Name: John Doe1      Address: 145 Pearl St, La Jolla, CA 92013   phone: 858-123-1233
Name: John Doe2      Address: 123 Main St, Los Angeles, CA 95002 phone: 213-123-1234
Name: Billy Bob Doe3 Address: 454 Heartland St, Mobile, AL 00103 phone: 205-123-1232

That is sorted by the last name, first name used in the dict key.

Now print it sorted by area code:

for add in sorted(addr.keys(),key=lambda x: addr[x][2] ):
    print 'Name: {0:{w0}} Address: {1:{w1}} phone: {2:{w2}}'.format(*addr[add],w0=w0,w1=w1,w2=w2)

Prints:

Name: Billy Bob Doe3 Address: 454 Heartland St, Mobile, AL 00103 phone: 205-123-1232
Name: John Doe2      Address: 123 Main St, Los Angeles, CA 95002 phone: 213-123-1234
Name: John Doe1      Address: 145 Pearl St, La Jolla, CA 92013   phone: 858-123-1233

But, since you have the data in a indexed dictionary, you can print it as a table instead sorted by zip code:

# print table header
print '|{0:^{w0}}|{1:^{w1}}|{2:^{w2}}|'.format('Name','Address','Phone',w0=w0+2,w1=w1+2,w2=w2+2)
print '|{0:^{w0}}|{1:^{w1}}|{2:^{w2}}|'.format('----','-------','-----',w0=w0+2,w1=w1+2,w2=w2+2)
# print data sorted by last field of the address - probably a zip code
for add in sorted(addr.keys(),key=lambda x: addr[x][1].split()[-1]):
    print '|{0:>{w0}}|{1:>{w1}}|{2:>{w2}}|'.format(*addr[add],w0=w0+2,w1=w1+2,w2=w2+2)

Prints:

|      Name      |              Address               |    Phone     |
|      ----      |              -------               |    -----     |
|  Billy Bob Doe3|  454 Heartland St, Mobile, AL 00103|  205-123-1232|
|       John Doe1|    145 Pearl St, La Jolla, CA 92013|  858-123-1233|
|       John Doe2|  123 Main St, Los Angeles, CA 95002|  213-123-1234|

Comments

0

You should be able to parse this using the split() method on a string:

line = "Name: John Doe1"
key, value = line.split(":")
print(key) # Name
print(value) # John Doe1

Comments

0

You can iterate over lines and print them in columns like this -

for line in open("/path/to/data"):
    if len(line) != 1:
        # remove \n from line's end and make print statement
        # skip the \n it adds in the end to continue in our column
        print "%s\t\t" % line.strip(),
    else:
        # re-use the blank lines to end our column
        print

Comments

0
#!/usr/bin/env python

def parse(inputfile, outputfile):
    dictInfo = {'Name':None, 'address':None, 'phone':None}
    for line in inputfile:
    if line.startswith('Name'):
        dictInfo['Name'] = line.split(':')[1].strip()
    elif line.startswith('address'):
        dictInfo['address'] = line.split(':')[1].strip()
    elif line.startswith('phone'):
        dictInfo['phone'] = line.split(':')[1].strip()
        s = 'Name: '+dictInfo['Name']+'\t'+'address: '+dictInfo['address'] \
            +'\t'+'phone: '+dictInfo['phone']+'\n'
        outputfile.write(s)

if __name__ == '__main__':
    with open('output.txt', 'w') as outputfile:
    with open('infomation.txt') as inputfile:
        parse(inputfile, outputfile)

Comments

0

A solution using sed.

cat input.txt | sed '/^$/d' | sed 'N; s:\n:\t\t:; N; s:\n:\t\t:'
  1. First pipe, sed '/^$/d', removes the blank lines.
  2. Second pipe, sed 'N; s:\n:\t\t:; N; s:\n:\t\t:', combines the lines.
Name: John Doe1     address : somewhere     phone: 123-123-1234
Name: John Doe2     address : somewhere     phone: 123-123-1233
Name: John Doe3     address : somewhere     phone: 123-123-1232

Comments

0

In Python:

results = []
cur_item = None

with open('/root/docs/information') as f:
    for line in f.readlines():
        key, value = line.split(':', 1)
        key = key.strip()
        value = value.strip()

        if key == "Name":
            cur_item = {}
            results.append(cur_item)
        cur_item[key] = value

for item in results:
    # print item

3 Comments

You should precise the language ;)
@sputnick I'm not quite I understand what you mean
Just say the language: It's Python.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.