How to parse data with binary elements into a list of lists in Python?

Question

Sample looks like this:

lst = ['ms 20 3 -s 10 \n', '17954 11302 58011\n', '\n', '$$\n', 'segsites: 10\n', 'positions: 0.0706 0.2241 0.2575 0.889 \n', '0001000010\n', '0101000010\n', '0101010010\n', '0001000010\n', '\n', '$$\n', 'segsites: 10\n', 'positions: 0.0038 0.1622 0.1972 \n', '0110000110\n', '1001001000\n', '0010000110\n', '$$\n', 'segsites: 10\n', 'positions: 0.0155 0.0779 0.2092 \n', '0000001011\n', '0000001011\n', '0000001011\n']

Every new set starts with the $$. I need to parse the data such that, I have the following list of lists.

sample = [['0001000010', '0101000010', '0101010010', '0001000010'],['0110000110', '1001001000', '0010000110'],['0000001011', '0000001011', '0000001011'] # Required Output

Code attempted

sample =[[]]
sample1 = ""
seqlist = []

for line in lst: 
    if line.startswith("$$"):
        if line in '01': #Line contains only 0's or 1
          sample1.append(line) #Append each line that with 1 and 0's in a string one after another
    sample.append(sample1.strip()) #Do this or last line is lost
print sample

Output:[[], '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

I am a newbie at parsing data and trying to figure out how to get this right. Suggestions on how to modify the code along with explanation is appreciated.

Darkstarone · Accepted Answer · 2016-11-07 05:34:30Z

1

I'd do it in the following way:

import re

lst = ['ms 20 3 -s 10 \n', '17954 11302 58011\n', '\n', '$$\n', 'segsites: 10\n', 'positions: 0.0706 0.2241 0.2575 0.889 \n', '0001000010\n', '0101000010\n', '0101010010\n', '0001000010\n', '\n', '$$\n', 'segsites: 10\n', 'positions: 0.0038 0.1622 0.1972 \n', '0110000110\n', '1001001000\n', '0010000110\n', '$$\n', 'segsites: 10\n', 'positions: 0.0155 0.0779 0.2092 \n', '0000001011\n', '0000001011\n', '0000001011\n']

result = []
curr_group = []
for item in lst:
    item = item.rstrip() # Remove \n
    if '$$' in item:
        if len(curr_group) > 0: # Check to see if binary numbers have been found.
            result.append(curr_group)
            curr_group = []
    elif re.match('[01]+$', item): # Checks to see if string is binary (0s or 1s).
        curr_group.append(item)

result.append(curr_group) # Appends final group due to lack of ending '$$'. 

print(result)

Basically, you want to iterate through the items until you find '$$', then add any binary characters you've found previously to your final result, and start a new group. Every binary string you find (using the regex) should be added to the current group.

Finally, you need to add the last set of binary numbers, since there is no trailing '$$'

answered Nov 7, 2016 at 5:34

Darkstarone

4,7408 gold badges40 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

biogeek Over a year ago

I seem to have trouble making it work for my original data set though. eval.in/673188

biogeek Over a year ago

In my original data set, the separators ($$) are different. Once I change the type of separators, the output collapses.

Darkstarone Over a year ago

Sorry was making dinner - shouldn't you just be able to change the line: if '$$' in item: so that it looks for a different type of separator, e.g. if your separator was '@@' it would look like: if '@@' in item:. Or is your problem more nuanced?

biogeek Over a year ago

Thats alright. I did exactly that, but strangely it did not work eval.in/673188

Darkstarone Over a year ago

Ah just saw your link. Your issue is you assumed the regex pattern $ was related to your separator. This isn't the case: eval.in/673215. Specifically the $ in a regex pattern means that the preceeding regex commands must meet the end of the string or the regex won't match. This stops a match on strings like '01010NOTBINARY'.

|

Right leg · Accepted Answer · 2016-11-07 05:33:59Z

1

Your problem is (at least) here: if line in '01'.

This line means if line == '0' or line == '1', which is absolutely not what you want.

A basic but working approach, would be to test, for every string, if it is composed only of 0 and 1:

def is_binary(string) :
    for c in string :
        if c not in '01' :
            return False
    return True

This function returns True if string can be interpreted as a binary value, False if not.

Of course you have to manage that '\n' at the end, but you got the main idea ;)

answered Nov 7, 2016 at 5:33

Right leg

17k8 gold badges56 silver badges90 bronze badges

Collectives™ on Stack Overflow

How to parse data with binary elements into a list of lists in Python?

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related