Parsing messed up text table in python

Question

So I have a text table which looks like the following:

BLOCK 1.  MARKERS: 1 2
42 (0.500)  |0.269  0.166   0.041   0.024|
21 (0.351)  |0.069  0.119   0.079   0.084|
22 (0.149)  |0.054  0.040   0.055   0.000|
Multiallelic Dprime: 0.295
BLOCK 2.  MARKERS: 9 10 11 12
1123 (0.392)    |0.351  0.037|
2341 (0.324)    |0.277  0.043|
2121 (0.176)    |0.016  0.164|
1121 (0.108)    |0.073  0.036|
Multiallelic Dprime: 0.591
BLOCK 3.  MARKERS: 13 14
13 (0.716)
34 (0.284)

For each block, I only need the following information:

BLOCK1:
42 0.500
21 0.351
22 0.149

I don't have any problem parsing individuals lines. And extracting what I need. Probably a list of a lists, should be my goal. My problem is that I cannot read the exact number of lines for each block, without getting an error at the end.

So I've wrote this ugly code:

file = open('haplotypes_hetero.txt')

to_parse = []

for line in file:
        to_parse.append(line.strip())

to_parse_2=[]

for line in to_parse:
        line = line.split()
        to_parse_2.append(line)

for i in range(len(to_parse_2)):
        if to_parse_2[i][0]=='BLOCK':
                z=i
                if z < len(to_parse_2):
                        z+=1
                while to_parse_2[z][0] != 'BLOCK':
                        print to_parse_2[z][0]
                        z+=1
                        if z>len(to_parse_2):
                                z=0


file.close()

It kinda works, and prints what it supposed to. However I am getting an error at the end.

42
21
22
Multiallelic
1123
2341
2121
1121
Multiallelic
13
34
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)

How do I get rid of the index error?

Chiyaan Suraj · Accepted Answer · 2015-04-10 16:25:16Z

3

I think the problem is with this

if z>len(to_parse_2):
      z=0

because your program is checking only if the Z value becomes greater than length of list. It shouldn't increment Z when the Z value becomes equal to length of list. So change those lines to

if z >= len(to_parse_2) : #changed '>' to >=
      z=0

answered Apr 10, 2015 at 16:25

Chiyaan Suraj

1,0053 gold badges14 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Kashyap · Accepted Answer · 2015-04-10 16:33:52Z

Sorry, couldn't wait any longer..

>>> s='''BLOCK 1.  MARKERS: 1 2
... ... 42 (0.500)  |0.269  0.166   0.041   0.024|
... ... 21 (0.351)  |0.069  0.119   0.079   0.084|
... ... 22 (0.149)  |0.054  0.040   0.055   0.000|
... ... Multiallelic Dprime: 0.295
... ... BLOCK 2.  MARKERS: 9 10 11 12
... ... 1123 (0.392)    |0.351  0.037|
... ... 2341 (0.324)    |0.277  0.043|
... ... 2121 (0.176)    |0.016  0.164|
... ... 1121 (0.108)    |0.073  0.036|
... ... Multiallelic Dprime: 0.591
... ... BLOCK 3.  MARKERS: 13 14
... ... 13 (0.716)
... ... 34 (0.284)'''
>>> re.findall(r'(?:(\d+)\s+\(([\d.]+)\)|(BLOCK \d+))',s)
[('', '', 'BLOCK 1'), ('42', '0.500', ''), ('21', '0.351', ''), ('22', '0.149', ''), ('', '', 'BLOCK 2'), ('1123', '0.392', ''), ('2341', '0.324', ''), ('2121', '0.176', ''), ('1121', '0.108', ''), ('', '', 'BLOCK 3'), ('13', '0.716', ''), ('34', '0.284', '')]

This:

file = open('haplotypes_hetero.txt')

to_parse = []

for line in file:
        to_parse.append(line.strip())

to_parse_2=[]

for line in to_parse:
        line = line.split()
        to_parse_2.append(line)

can be replaced with:

to_parse_2 = [ l.split() for l in open('haplotypes_hetero.txt').realines() ]

I highly recommend learning python's list comprehensions

dawg · Accepted Answer · 2015-04-12 18:48:33Z

2

You can try something like this:

table='''\
BLOCK 1.  MARKERS: 1 2
42 (0.500)  |0.269  0.166   0.041   0.024|
21 (0.351)  |0.069  0.119   0.079   0.084|
22 (0.149)  |0.054  0.040   0.055   0.000|
Multiallelic Dprime: 0.295
BLOCK 2.  MARKERS: 9 10 11 12
1123 (0.392)    |0.351  0.037|
2341 (0.324)    |0.277  0.043|
2121 (0.176)    |0.016  0.164|
1121 (0.108)    |0.073  0.036|
Multiallelic Dprime: 0.591
BLOCK 3.  MARKERS: 13 14
13 (0.716)
34 (0.284)'''

import re

d={}
for title, block in re.findall(r'^(BLOCK \d+)\..*?\n(.*?)(?=^BLOCK|\Z)', table, flags=re.M | re.S):
    d[title]=[]
    for line in block.splitlines():
        print line
        t=line.partition(')')[0].partition('(')
        try: 
            d[title].append(map(float, [t[0], t[2]]))
        except ValueError:
            pass    

for k, v in d.items():
    print k,':',v

Prints:

BLOCK 1 : [[42.0, 0.5], [21.0, 0.351], [22.0, 0.149]]
BLOCK 2 : [[1123.0, 0.392], [2341.0, 0.324], [2121.0, 0.176], [1121.0, 0.108]]
BLOCK 3 : [[13.0, 0.716], [34.0, 0.284]]

edited Apr 12, 2015 at 18:48

answered Apr 10, 2015 at 16:40

dawg

105k24 gold badges142 silver badges217 bronze badges

2 Comments

YKY Over a year ago

seems like two last values are repeated.

YKY Over a year ago

I will definitely try that. The problem is that I am not familiar with regex. So I will use @Chiyaan suggestion to solve my problem. And will learn about regex later today. Thanks dude!

Kasravnd · Accepted Answer · 2015-04-10 16:56:34Z

You don't need some complex way for such problems, you can use regex :

>>> s="""BLOCK 1.  MARKERS: 1 2
... 42 (0.500)  |0.269  0.166   0.041   0.024|
... 21 (0.351)  |0.069  0.119   0.079   0.084|
... 22 (0.149)  |0.054  0.040   0.055   0.000|
... Multiallelic Dprime: 0.295
... BLOCK 2.  MARKERS: 9 10 11 12
... 1123 (0.392)    |0.351  0.037|
... 2341 (0.324)    |0.277  0.043|
... 2121 (0.176)    |0.016  0.164|
... 1121 (0.108)    |0.073  0.036|
... Multiallelic Dprime: 0.591
... BLOCK 3.  MARKERS: 13 14
... 13 (0.716)
... 34 (0.284)"""
>>> 
>>> 
>>> l=re.findall(r'((^BLOCK \d+\.)((?!BLOCK).)*)(?=BLOCK|$)',s,re.MULTILINE|re.DOTALL)
>>> [(i[-2],re.findall(r'(\d+)\s+\(([\d.]+)\)',i[0])) for i in l]
[('BLOCK 1.', [('42', '0.500'), ('21', '0.351'), ('22', '0.149')]), ('BLOCK 2.', [('1123', '0.392'), ('2341', '0.324'), ('2121', '0.176'), ('1121', '0.108')]), ('BLOCK 3.', [('13', '0.716'), ('34', '0.284')])]

First you need to extract the blocks, that you can use the following regex with re.findall :

>>> l=re.findall(r'((^BLOCK \d+\.)((?!BLOCK).)*)(?=BLOCK|$)',s,re.MULTILINE|re.DOTALL)

then you can use r'(\d+)\s+\(([\d.]+)\) to match a number that followed by 1 or more whitespace then a combination of digits with dot within a parenthesis.

As a side note ((?!BLOCK).)* will match any string that doesn't contain the word BLOCK and for for more read about the regex i suggest to check the http://www.regular-expressions.info/lookaround.html that explains about the look-around in regular expression!

Also instead of list comprehension you can use a dictionary comprehension :

>>> {i[-2]:re.findall(r'(\d+)\s+\(([\d.]+)\)',i[0]) for i in l}

{'BLOCK 1.': [('42', '0.500'), ('21', '0.351'), ('22', '0.149')], 
 'BLOCK 2.': [('1123', '0.392'), ('2341', '0.324'), ('2121', '0.176'), ('1121', '0.108')], 
 'BLOCK 3.': [('13', '0.716'), ('34', '0.284')]}

yeah It is great. My problem is that I need to assign each line for a corresponding block. How do I do that?

Collectives™ on Stack Overflow

Parsing messed up text table in python

4 Answers 4

Comments

Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related