0

So I have a text table which looks like the following:

BLOCK 1.  MARKERS: 1 2
42 (0.500)  |0.269  0.166   0.041   0.024|
21 (0.351)  |0.069  0.119   0.079   0.084|
22 (0.149)  |0.054  0.040   0.055   0.000|
Multiallelic Dprime: 0.295
BLOCK 2.  MARKERS: 9 10 11 12
1123 (0.392)    |0.351  0.037|
2341 (0.324)    |0.277  0.043|
2121 (0.176)    |0.016  0.164|
1121 (0.108)    |0.073  0.036|
Multiallelic Dprime: 0.591
BLOCK 3.  MARKERS: 13 14
13 (0.716)
34 (0.284)

For each block, I only need the following information:

BLOCK1:
42 0.500
21 0.351
22 0.149

I don't have any problem parsing individuals lines. And extracting what I need. Probably a list of a lists, should be my goal. My problem is that I cannot read the exact number of lines for each block, without getting an error at the end.

So I've wrote this ugly code:

file = open('haplotypes_hetero.txt')

to_parse = []

for line in file:
        to_parse.append(line.strip())

to_parse_2=[]

for line in to_parse:
        line = line.split()
        to_parse_2.append(line)

for i in range(len(to_parse_2)):
        if to_parse_2[i][0]=='BLOCK':
                z=i
                if z < len(to_parse_2):
                        z+=1
                while to_parse_2[z][0] != 'BLOCK':
                        print to_parse_2[z][0]
                        z+=1
                        if z>len(to_parse_2):
                                z=0


file.close()

It kinda works, and prints what it supposed to. However I am getting an error at the end.

42
21
22
Multiallelic
1123
2341
2121
1121
Multiallelic
13
34
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)

How do I get rid of the index error?

4 Answers 4

3

I think the problem is with this

if z>len(to_parse_2):
      z=0

because your program is checking only if the Z value becomes greater than length of list. It shouldn't increment Z when the Z value becomes equal to length of list. So change those lines to

if z >= len(to_parse_2) : #changed '>' to >=
      z=0 
Sign up to request clarification or add additional context in comments.

Comments

2

Sorry, couldn't wait any longer..

>>> s='''BLOCK 1.  MARKERS: 1 2
... ... 42 (0.500)  |0.269  0.166   0.041   0.024|
... ... 21 (0.351)  |0.069  0.119   0.079   0.084|
... ... 22 (0.149)  |0.054  0.040   0.055   0.000|
... ... Multiallelic Dprime: 0.295
... ... BLOCK 2.  MARKERS: 9 10 11 12
... ... 1123 (0.392)    |0.351  0.037|
... ... 2341 (0.324)    |0.277  0.043|
... ... 2121 (0.176)    |0.016  0.164|
... ... 1121 (0.108)    |0.073  0.036|
... ... Multiallelic Dprime: 0.591
... ... BLOCK 3.  MARKERS: 13 14
... ... 13 (0.716)
... ... 34 (0.284)'''
>>> re.findall(r'(?:(\d+)\s+\(([\d.]+)\)|(BLOCK \d+))',s)
[('', '', 'BLOCK 1'), ('42', '0.500', ''), ('21', '0.351', ''), ('22', '0.149', ''), ('', '', 'BLOCK 2'), ('1123', '0.392', ''), ('2341', '0.324', ''), ('2121', '0.176', ''), ('1121', '0.108', ''), ('', '', 'BLOCK 3'), ('13', '0.716', ''), ('34', '0.284', '')]

This:

file = open('haplotypes_hetero.txt')

to_parse = []

for line in file:
        to_parse.append(line.strip())

to_parse_2=[]

for line in to_parse:
        line = line.split()
        to_parse_2.append(line)

can be replaced with:

to_parse_2 = [ l.split() for l in open('haplotypes_hetero.txt').realines() ]

I highly recommend learning python's list comprehensions

Comments

2

You can try something like this:

table='''\
BLOCK 1.  MARKERS: 1 2
42 (0.500)  |0.269  0.166   0.041   0.024|
21 (0.351)  |0.069  0.119   0.079   0.084|
22 (0.149)  |0.054  0.040   0.055   0.000|
Multiallelic Dprime: 0.295
BLOCK 2.  MARKERS: 9 10 11 12
1123 (0.392)    |0.351  0.037|
2341 (0.324)    |0.277  0.043|
2121 (0.176)    |0.016  0.164|
1121 (0.108)    |0.073  0.036|
Multiallelic Dprime: 0.591
BLOCK 3.  MARKERS: 13 14
13 (0.716)
34 (0.284)'''

import re

d={}
for title, block in re.findall(r'^(BLOCK \d+)\..*?\n(.*?)(?=^BLOCK|\Z)', table, flags=re.M | re.S):
    d[title]=[]
    for line in block.splitlines():
        print line
        t=line.partition(')')[0].partition('(')
        try: 
            d[title].append(map(float, [t[0], t[2]]))
        except ValueError:
            pass    

for k, v in d.items():
    print k,':',v

Prints:

BLOCK 1 : [[42.0, 0.5], [21.0, 0.351], [22.0, 0.149]]
BLOCK 2 : [[1123.0, 0.392], [2341.0, 0.324], [2121.0, 0.176], [1121.0, 0.108]]
BLOCK 3 : [[13.0, 0.716], [34.0, 0.284]]

2 Comments

seems like two last values are repeated.
I will definitely try that. The problem is that I am not familiar with regex. So I will use @Chiyaan suggestion to solve my problem. And will learn about regex later today. Thanks dude!
1

You don't need some complex way for such problems, you can use regex :

>>> s="""BLOCK 1.  MARKERS: 1 2
... 42 (0.500)  |0.269  0.166   0.041   0.024|
... 21 (0.351)  |0.069  0.119   0.079   0.084|
... 22 (0.149)  |0.054  0.040   0.055   0.000|
... Multiallelic Dprime: 0.295
... BLOCK 2.  MARKERS: 9 10 11 12
... 1123 (0.392)    |0.351  0.037|
... 2341 (0.324)    |0.277  0.043|
... 2121 (0.176)    |0.016  0.164|
... 1121 (0.108)    |0.073  0.036|
... Multiallelic Dprime: 0.591
... BLOCK 3.  MARKERS: 13 14
... 13 (0.716)
... 34 (0.284)"""
>>> 
>>> 
>>> l=re.findall(r'((^BLOCK \d+\.)((?!BLOCK).)*)(?=BLOCK|$)',s,re.MULTILINE|re.DOTALL)
>>> [(i[-2],re.findall(r'(\d+)\s+\(([\d.]+)\)',i[0])) for i in l]
[('BLOCK 1.', [('42', '0.500'), ('21', '0.351'), ('22', '0.149')]), ('BLOCK 2.', [('1123', '0.392'), ('2341', '0.324'), ('2121', '0.176'), ('1121', '0.108')]), ('BLOCK 3.', [('13', '0.716'), ('34', '0.284')])]

First you need to extract the blocks, that you can use the following regex with re.findall :

>>> l=re.findall(r'((^BLOCK \d+\.)((?!BLOCK).)*)(?=BLOCK|$)',s,re.MULTILINE|re.DOTALL)

then you can use r'(\d+)\s+\(([\d.]+)\) to match a number that followed by 1 or more whitespace then a combination of digits with dot within a parenthesis.

As a side note ((?!BLOCK).)* will match any string that doesn't contain the word BLOCK and for for more read about the regex i suggest to check the http://www.regular-expressions.info/lookaround.html that explains about the look-around in regular expression!

Also instead of list comprehension you can use a dictionary comprehension :

>>> {i[-2]:re.findall(r'(\d+)\s+\(([\d.]+)\)',i[0]) for i in l}

{'BLOCK 1.': [('42', '0.500'), ('21', '0.351'), ('22', '0.149')], 
 'BLOCK 2.': [('1123', '0.392'), ('2341', '0.324'), ('2121', '0.176'), ('1121', '0.108')], 
 'BLOCK 3.': [('13', '0.716'), ('34', '0.284')]}

1 Comment

yeah It is great. My problem is that I need to assign each line for a corresponding block. How do I do that?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.