0

The sample file looks like this (all on one line, wrapped for legibility):

 ['>1\n', 'TCCGGGGGTATC\n', '>2\n', 'TCCGTGGGTATC\n',
  '>3\n', 'TCCGTGGGTATC\n', '>4\n', 'TCCGGGGGTATC\n',
  '>5\n', 'TCCGTGGGTATC\n', '>6\n', 'TCCGTGGGTATC\n',
  '>7\n', 'TCCGTGGGTATC\n', '>8\n', 'TCCGGGGGTATC\n','\n',
  '$$$\n', '\n',
  '>B1\n', 'ATCGGGGGTATT\n', '>B2\n', 'TT-GTGGGAATC\n',
  '>3\n', 'TTCGTGGGAATC\n', '>B4\n', 'TT-GTGGGTATC\n',
  '>B5\n', 'TTCGTGGGTATT\n', '>B6\n','TTCGGGGGTATC\n',
  '>B7\n', 'TT-GTGGGTATC\n', '>B8\n', 'TTCGGGGGAATC\n',
  '>B9\n', 'TTCGGGGGTATC\n','>B10\n', 'TTCGGGGGTATC\n',
  '>B42\n', 'TT-GTGGGTATC\n']

The $$$ separates the two sets. I need to use .strip function and remove the \n and all the "headers".

I need to make a list of lists (as below) and replace "-" with Z (again, all on one line; wrapped here for legibility):

  [['TCCGGGGGTATC','TCCGTGGGTATC','TCCGTGGGTATC', 'TCCGGGGGTATC',
    'TCCGTGGGTATC',CGTGGGTATC','TCCGTGGGTATC', 'TCCGGGGGTATC'],
   ['ATCGGGGGTATT', 'TT-GTGGGAATC','TTCGTGGGAATC', 'TT-GTGGGTATC',
    'TTCGTGGGTATT', 'TTCGGGGGTATC','TT-GTGGGTATC', 'TTCGGGGGAATC',
    'TTCGGGGGTATC', 'TTCGGGGGTATC','TT-GTGGGTATC]]
5
  • Where are you stuck? Commented Oct 10, 2016 at 18:47
  • If you are dealing with (a variant of) FASTA format, you'd simplify life for everyone if you mentioned this. Commented Oct 10, 2016 at 18:51
  • Some quotes and the B in B3 seems to be missing, can you please proofread the examples? Commented Oct 10, 2016 at 18:52
  • No, I put it that way because when we get research files, the headers can be pretty muddled up .. so the codes should not be header specific Commented Oct 10, 2016 at 18:55
  • You still lack quotes in the expected result. Commented Oct 10, 2016 at 19:01

2 Answers 2

2

You can exploit the smaller length of the headers (and other unwanted items) as the criterion to filter them out. You start by creating a list containing one list and appending the items that pass the length test to the inner list.

A new sublist is added to the resulting list when the separator '$$$' is reached, and the length test is again used to add the remaining items to this new sublist:

lst = ['>1\n', 'TCCGGGGGTATC\n', '>2\n', 'TCCGTGGGTATC\n', '>3\n', 'TCCGTGGGTATC\n', '>4\n', 'TCCGGGGGTATC\n', '>5\n', 'TCCGTGGGTATC\n', '>6\n', 'TCCGTGGGTATC\n', '>7\n', 'TCCGTGGGTATC\n', '>8\n', 'TCCGGGGGTATC\n','\n', '$$$\n', '\n', '>B1\n', 'ATCGGGGGTATT\n', '>B2\n', 'TT-GTGGGAATC\n', '>3\n', 'TTCGTGGGAATC\n', '>B4\n', 'TT-GTGGGTATC\n', '>B5\n', 'TTCGTGGGTATT\n', '>B6\n','TTCGGGGGTATC\n', '>B7\n', 'TT-GTGGGTATC\n', '>B8\n', 'TTCGGGGGAATC\n', '>B9\n', 'TTCGGGGGTATC\n','>B10\n', 'TTCGGGGGTATC\n','>B42\n', 'TT-GTGGGTATC\n']

result = [[]]
for x in lst:
    if len(x) > 6:
        result[-1].append(x.strip())
    if x.startswith('$$$'):
        result.append([])
print(result)
# [['TCCGGGGGTATC', 'TCCGTGGGTATC', 'TCCGTGGGTATC', 'TCCGGGGGTATC', 'TCCGTGGGTATC', 'TCCGTGGGTATC', 'TCCGTGGGTATC', 'TCCGGGGGTATC'], ['ATCGGGGGTATT', 'TT-GTGGGAATC', 'TTCGTGGGAATC', 'TT-GTGGGTATC', 'TTCGTGGGTATT', 'TTCGGGGGTATC', 'TT-GTGGGTATC', 'TTCGGGGGAATC', 'TTCGGGGGTATC', 'TTCGGGGGTATC', 'TT-GTGGGTATC']]
Sign up to request clarification or add additional context in comments.

3 Comments

Instead of 6, can the code just consider the length of the element eg. [TACG] would be 4
If all the headers start with > that will be a much better criterion, though you'll apparently also want to discard strings which are empty after stripping.
5 is maximum length of every other item. 6 guarantees that only your required strings are added
1

Here is a variation of Moses Koledoye's answer which examines the first character for > and discards any matches as well as any empty elements. I also included replacing "-" with "Z".

lst = ['>1\n', 'TCCGGGGGTATC\n', '>2\n', 'TCCGTGGGTATC\n',
   '>3\n', 'TCCGTGGGTATC\n', '>4\n', 'TCCGGGGGTATC\n',
   '>5\n', 'TCCGTGGGTATC\n', '>6\n', 'TCCGTGGGTATC\n',
   '>7\n', 'TCCGTGGGTATC\n', '>8\n', 'TCCGGGGGTATC\n','\n',
   '$$$\n', '\n',
   '>B1\n', 'ATCGGGGGTATT\n', '>B2\n', 'TT-GTGGGAATC\n',
   '>3\n', 'TTCGTGGGAATC\n', '>B4\n', 'TT-GTGGGTATC\n',
   '>B5\n', 'TTCGTGGGTATT\n', '>B6\n','TTCGGGGGTATC\n',
   '>B7\n', 'TT-GTGGGTATC\n', '>B8\n', 'TTCGGGGGAATC\n',
   '>B9\n', 'TTCGGGGGTATC\n','>B10\n', 'TTCGGGGGTATC\n',
   '>B42\n', 'TT-GTGGGTATC\n']

result = [[]]
for x in lst:
    if x.startswith('>'):
        continue
    if x.startswith('$$$'):
        result.append([])
        continue
    x = x.strip()
    if x:
        result[-1].append(x.replace("-", "Z"))
print(result)

This avoids assigning any particular significance to the length of any element.

2 Comments

Just an extension of this problem. If I need just a list where Set1 and Set2 is all in one list ['TCCGGGGGTATC','TCCGTGGGTATC','TCCGTGGGTATC', 'TCCGGGGGTATC', 'TCCGTGGGTATC',CGTGGGTATC','TCCGTGGGTATC', 'TCCGGGGGTATC'.. ], then is this how my code would look like? result = [] for x in lst: if x.startswith('>'): result.append([]) continue x = x.strip() if x: result[-1].append(x.replace("-", "Z")) print(result)
You can't really post Python code in comments. But this is obviously not hard. Post a new question if you can't figure it out (link back to this one for context).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.