How to make a list of lists in Python when it has multiple separators?

Question

The sample file looks like this (all on one line, wrapped for legibility):

 ['>1\n', 'TCCGGGGGTATC\n', '>2\n', 'TCCGTGGGTATC\n',
  '>3\n', 'TCCGTGGGTATC\n', '>4\n', 'TCCGGGGGTATC\n',
  '>5\n', 'TCCGTGGGTATC\n', '>6\n', 'TCCGTGGGTATC\n',
  '>7\n', 'TCCGTGGGTATC\n', '>8\n', 'TCCGGGGGTATC\n','\n',
  '$$$\n', '\n',
  '>B1\n', 'ATCGGGGGTATT\n', '>B2\n', 'TT-GTGGGAATC\n',
  '>3\n', 'TTCGTGGGAATC\n', '>B4\n', 'TT-GTGGGTATC\n',
  '>B5\n', 'TTCGTGGGTATT\n', '>B6\n','TTCGGGGGTATC\n',
  '>B7\n', 'TT-GTGGGTATC\n', '>B8\n', 'TTCGGGGGAATC\n',
  '>B9\n', 'TTCGGGGGTATC\n','>B10\n', 'TTCGGGGGTATC\n',
  '>B42\n', 'TT-GTGGGTATC\n']

The $$$ separates the two sets. I need to use .strip function and remove the \n and all the "headers".

I need to make a list of lists (as below) and replace "-" with Z (again, all on one line; wrapped here for legibility):

  [['TCCGGGGGTATC','TCCGTGGGTATC','TCCGTGGGTATC', 'TCCGGGGGTATC',
    'TCCGTGGGTATC',CGTGGGTATC','TCCGTGGGTATC', 'TCCGGGGGTATC'],
   ['ATCGGGGGTATT', 'TT-GTGGGAATC','TTCGTGGGAATC', 'TT-GTGGGTATC',
    'TTCGTGGGTATT', 'TTCGGGGGTATC','TT-GTGGGTATC', 'TTCGGGGGAATC',
    'TTCGGGGGTATC', 'TTCGGGGGTATC','TT-GTGGGTATC]]

If you are dealing with (a variant of) FASTA format, you'd simplify life for everyone if you mentioned this. — tripleee
– tripleee, Commented Oct 10, 2016 at 18:51
Some quotes and the B in B3 seems to be missing, can you please proofread the examples? — tripleee
– tripleee, Commented Oct 10, 2016 at 18:52
No, I put it that way because when we get research files, the headers can be pretty muddled up .. so the codes should not be header specific — Rspacer
– Rspacer, Commented Oct 10, 2016 at 18:55

Moses Koledoye · Accepted Answer · 2016-10-10 18:55:57Z

2

You can exploit the smaller length of the headers (and other unwanted items) as the criterion to filter them out. You start by creating a list containing one list and appending the items that pass the length test to the inner list.

A new sublist is added to the resulting list when the separator '$$$' is reached, and the length test is again used to add the remaining items to this new sublist:

lst = ['>1\n', 'TCCGGGGGTATC\n', '>2\n', 'TCCGTGGGTATC\n', '>3\n', 'TCCGTGGGTATC\n', '>4\n', 'TCCGGGGGTATC\n', '>5\n', 'TCCGTGGGTATC\n', '>6\n', 'TCCGTGGGTATC\n', '>7\n', 'TCCGTGGGTATC\n', '>8\n', 'TCCGGGGGTATC\n','\n', '$$$\n', '\n', '>B1\n', 'ATCGGGGGTATT\n', '>B2\n', 'TT-GTGGGAATC\n', '>3\n', 'TTCGTGGGAATC\n', '>B4\n', 'TT-GTGGGTATC\n', '>B5\n', 'TTCGTGGGTATT\n', '>B6\n','TTCGGGGGTATC\n', '>B7\n', 'TT-GTGGGTATC\n', '>B8\n', 'TTCGGGGGAATC\n', '>B9\n', 'TTCGGGGGTATC\n','>B10\n', 'TTCGGGGGTATC\n','>B42\n', 'TT-GTGGGTATC\n']

result = [[]]
for x in lst:
    if len(x) > 6:
        result[-1].append(x.strip())
    if x.startswith('$$$'):
        result.append([])
print(result)
# [['TCCGGGGGTATC', 'TCCGTGGGTATC', 'TCCGTGGGTATC', 'TCCGGGGGTATC', 'TCCGTGGGTATC', 'TCCGTGGGTATC', 'TCCGTGGGTATC', 'TCCGGGGGTATC'], ['ATCGGGGGTATT', 'TT-GTGGGAATC', 'TTCGTGGGAATC', 'TT-GTGGGTATC', 'TTCGTGGGTATT', 'TTCGGGGGTATC', 'TT-GTGGGTATC', 'TTCGGGGGAATC', 'TTCGGGGGTATC', 'TTCGGGGGTATC', 'TT-GTGGGTATC']]

edited Oct 10, 2016 at 18:55

answered Oct 10, 2016 at 18:50

Moses Koledoye

78.8k8 gold badges139 silver badges141 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Rspacer Over a year ago

Instead of 6, can the code just consider the length of the element eg. [TACG] would be 4

tripleee Over a year ago

If all the headers start with > that will be a much better criterion, though you'll apparently also want to discard strings which are empty after stripping.

Moses Koledoye Over a year ago

5 is maximum length of every other item. 6 guarantees that only your required strings are added

tripleee · Accepted Answer · 2016-10-10 19:02:46Z

1

Here is a variation of Moses Koledoye's answer which examines the first character for > and discards any matches as well as any empty elements. I also included replacing "-" with "Z".

lst = ['>1\n', 'TCCGGGGGTATC\n', '>2\n', 'TCCGTGGGTATC\n',
   '>3\n', 'TCCGTGGGTATC\n', '>4\n', 'TCCGGGGGTATC\n',
   '>5\n', 'TCCGTGGGTATC\n', '>6\n', 'TCCGTGGGTATC\n',
   '>7\n', 'TCCGTGGGTATC\n', '>8\n', 'TCCGGGGGTATC\n','\n',
   '$$$\n', '\n',
   '>B1\n', 'ATCGGGGGTATT\n', '>B2\n', 'TT-GTGGGAATC\n',
   '>3\n', 'TTCGTGGGAATC\n', '>B4\n', 'TT-GTGGGTATC\n',
   '>B5\n', 'TTCGTGGGTATT\n', '>B6\n','TTCGGGGGTATC\n',
   '>B7\n', 'TT-GTGGGTATC\n', '>B8\n', 'TTCGGGGGAATC\n',
   '>B9\n', 'TTCGGGGGTATC\n','>B10\n', 'TTCGGGGGTATC\n',
   '>B42\n', 'TT-GTGGGTATC\n']

result = [[]]
for x in lst:
    if x.startswith('>'):
        continue
    if x.startswith('$$$'):
        result.append([])
        continue
    x = x.strip()
    if x:
        result[-1].append(x.replace("-", "Z"))
print(result)

This avoids assigning any particular significance to the length of any element.

answered Oct 10, 2016 at 19:02

tripleee

192k37 gold badges318 silver badges367 bronze badges

2 Comments

Rspacer Over a year ago

Just an extension of this problem. If I need just a list where Set1 and Set2 is all in one list ['TCCGGGGGTATC','TCCGTGGGTATC','TCCGTGGGTATC', 'TCCGGGGGTATC', 'TCCGTGGGTATC',CGTGGGTATC','TCCGTGGGTATC', 'TCCGGGGGTATC'.. ], then is this how my code would look like? result = [] for x in lst: if x.startswith('>'): result.append([]) continue x = x.strip() if x: result[-1].append(x.replace("-", "Z")) print(result)

tripleee Over a year ago

You can't really post Python code in comments. But this is obviously not hard. Post a new question if you can't figure it out (link back to this one for context).

Collectives™ on Stack Overflow

How to make a list of lists in Python when it has multiple separators?

2 Answers 2

3 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related