0

I'm new with Python, and I'm trying to filter a list of DNA sequences (strings) based on a condition.

Given my random list of DNA sequences, I want to keep in the list only the sequences in which A=T and C=G.

This is what I did so far, but I obtain the position of the string and not the string. I wonder how could I obtain a string in my output. Any advice would be great, thank you!

This is what I tried to do so far:

import numpy as np 
BASES = ('A','C','T','G')
P = (0.2, 0.3, 0.2, 0.3)

def random_dna_sequence(length): 
    return ''.join(np.random.choice(BASES, p=P) for _ in range(length)) 

dna = [random_dna_sequence(20) for _ in range(300)] #dna1=300 sequences of 20 characters each
print(dna)

Then, I tried to obtain the list of sequences that only accomplish this condition:

# Obtain a list dna_2 with DNA sequences that accomplish this condition only
dna_2 = [i for i in range(len(dna)) if (dna[i].count('A'))== (dna[i].count('T')) and (dna[i].count('C'))== (dna[i].count('G'))]
print(dna_2)

My output is returning the position of the sequences that accomplish this condition, but not the sequence itself (the sequence):

[29, 41, 66, 85, 88, 117, 142, 174, 201, 226, 231, 246, 250, 279, 294, 299, 306, 338, 370, 372, 381, 404, 420, 486, 519, 579]

And my desire output should be:

['AACTGACTTG', ...]

Thank you all!

1
  • 1
    In the second list comprehension you need dna[i] for i in ... Commented Mar 29, 2021 at 22:17

4 Answers 4

1
dna_2 = [i for i in range(len(dna)) if (dna[i].count('A'))== (dna[i].count('T')) and (dna[i].count('C'))== (dna[i].count('G'))]

i here refers to the index (which you probably know but mistyped since you are using dna[i] when calling .count).

You can change it to dna_2 = [dna[i] for i ...] or better yet, just iterate directly over the sequence strings instead of superficially using the indexes:

dna_2 = [sequence for sequence in dna if sequence.count('A') ... ]
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much for your explanation! I didn't realise that you gave me the solution as well :D Thank you all!
1

That should be

dna_2 = [dna[i] for i in range(len(dna)) if (dna[i].count('A'))== (dna[i].count('T')) and (dna[i].count('C'))== (dna[i].count('G'))]

instead.

3 Comments

Are you asking or answering?
Ok, answering ;-)
Thank you so much for your help. This is exactly was I was looking for. I wasn't using correctly the 'i' index in the formula. Thanks!! :)
0

For a more elegant solution just iterate over the list rather than iterating over the index:

dna_2 = [this_dna for this_dna in dna if (this_dna.count('A'))== (this_dna.count('T')) and (this_dna .count('C'))== (this_dna.count('G'))]

Comments

0

filter() function -> performance gain

You can just use the filter function with a lambda expression. The filter function algorithm is optimized under the hood.

You should learn the lambda syntax though. Lambdas are just inline functions.

Here is the syntax for your problem:

random_dna_sequencies = [random_dna_sequence(20) for _ in range(300)]

filtered_dna_sequencies_iterator = filter(lambda dna_sequency:
                                          dna_sequency.count('A') == dna_sequency.count('T') and
                                          dna_sequency.count('C') == dna_sequency.count('G'),
                                          random_dna_sequencies)

filtered_dna_sequencies_list = list(filtered_dna_sequencies_iterator)
print(filtered_dna_sequencies_list)

Let me explain:

  1. the filter() function receives two parameters:
    1. The second parameter is the iterator/list you want to filter.
    2. The first parameter is a lambda expression. Each element from the list random_dna_sequencies (which means each dna sequency) is passed to the lambda function as argument, named as dna_sequency. The pameter is tested in the condition dna_sequency.count('A') == dna_sequency.count('T') and dna_sequency.count('C') == dna_sequency.count('G'). Only the ones who satisfy the condition are returned to the variable filtered_dna_sequencies_iterator.
  2. the filter() function returns an iterator, not a list. If you want the object only to place it in a for loop, use the iterator. If you want to store the object on lists, you use the cast filtered_dna_sequencies_list = list(filtered_dna_sequencies_iterator).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.