Beginner in Python - Filter list of strings based on condition

Question

I'm new with Python, and I'm trying to filter a list of DNA sequences (strings) based on a condition.

Given my random list of DNA sequences, I want to keep in the list only the sequences in which A=T and C=G.

This is what I did so far, but I obtain the position of the string and not the string. I wonder how could I obtain a string in my output. Any advice would be great, thank you!

This is what I tried to do so far:

import numpy as np 
BASES = ('A','C','T','G')
P = (0.2, 0.3, 0.2, 0.3)

def random_dna_sequence(length): 
    return ''.join(np.random.choice(BASES, p=P) for _ in range(length)) 

dna = [random_dna_sequence(20) for _ in range(300)] #dna1=300 sequences of 20 characters each
print(dna)

Then, I tried to obtain the list of sequences that only accomplish this condition:

# Obtain a list dna_2 with DNA sequences that accomplish this condition only
dna_2 = [i for i in range(len(dna)) if (dna[i].count('A'))== (dna[i].count('T')) and (dna[i].count('C'))== (dna[i].count('G'))]
print(dna_2)

My output is returning the position of the sequences that accomplish this condition, but not the sequence itself (the sequence):

[29, 41, 66, 85, 88, 117, 142, 174, 201, 226, 231, 246, 250, 279, 294, 299, 306, 338, 370, 372, 381, 404, 420, 486, 519, 579]

And my desire output should be:

['AACTGACTTG', ...]

Thank you all!

In the second list comprehension you need dna[i] for i in ... — norie
– norie, Commented Mar 29, 2021 at 22:17

DeepSpace · Accepted Answer · 2021-03-29 22:35:16Z

1

dna_2 = [i for i in range(len(dna)) if (dna[i].count('A'))== (dna[i].count('T')) and (dna[i].count('C'))== (dna[i].count('G'))]

i here refers to the index (which you probably know but mistyped since you are using dna[i] when calling .count).

You can change it to dna_2 = [dna[i] for i ...] or better yet, just iterate directly over the sequence strings instead of superficially using the indexes:

dna_2 = [sequence for sequence in dna if sequence.count('A') ... ]

edited Mar 29, 2021 at 22:35

answered Mar 29, 2021 at 22:18

DeepSpace

82.1k12 gold badges119 silver badges166 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Marta Fredes Over a year ago

Thank you so much for your explanation! I didn't realise that you gave me the solution as well :D Thank you all!

Dave Atkinson · Accepted Answer · 2021-03-29 22:24:44Z

1

That should be

dna_2 = [dna[i] for i in range(len(dna)) if (dna[i].count('A'))== (dna[i].count('T')) and (dna[i].count('C'))== (dna[i].count('G'))]

instead.

edited Mar 29, 2021 at 22:24

answered Mar 29, 2021 at 22:21

Dave Atkinson

682 silver badges8 bronze badges

3 Comments

DeepSpace Over a year ago

Are you asking or answering?

Dave Atkinson Over a year ago

Ok, answering ;-)

Marta Fredes Over a year ago

Thank you so much for your help. This is exactly was I was looking for. I wasn't using correctly the 'i' index in the formula. Thanks!! :)

ModellingGeek · Accepted Answer · 2021-03-29 22:49:44Z

0

For a more elegant solution just iterate over the list rather than iterating over the index:

dna_2 = [this_dna for this_dna in dna if (this_dna.count('A'))== (this_dna.count('T')) and (this_dna .count('C'))== (this_dna.count('G'))]

answered Mar 29, 2021 at 22:49

ModellingGeek

791 silver badge4 bronze badges

Comments

Vinícius Vargas · Accepted Answer · 2021-03-29 23:02:47Z

filter() function -> performance gain

You can just use the filter function with a lambda expression. The filter function algorithm is optimized under the hood.

You should learn the lambda syntax though. Lambdas are just inline functions.

Here is the syntax for your problem:

random_dna_sequencies = [random_dna_sequence(20) for _ in range(300)]

filtered_dna_sequencies_iterator = filter(lambda dna_sequency:
                                          dna_sequency.count('A') == dna_sequency.count('T') and
                                          dna_sequency.count('C') == dna_sequency.count('G'),
                                          random_dna_sequencies)

filtered_dna_sequencies_list = list(filtered_dna_sequencies_iterator)
print(filtered_dna_sequencies_list)

Let me explain:

the filter() function receives two parameters:
1. The second parameter is the iterator/list you want to filter.
2. The first parameter is a lambda expression. Each element from the list random_dna_sequencies (which means each dna sequency) is passed to the lambda function as argument, named as dna_sequency. The pameter is tested in the condition dna_sequency.count('A') == dna_sequency.count('T') and dna_sequency.count('C') == dna_sequency.count('G'). Only the ones who satisfy the condition are returned to the variable filtered_dna_sequencies_iterator.
the filter() function returns an iterator, not a list. If you want the object only to place it in a for loop, use the iterator. If you want to store the object on lists, you use the cast filtered_dna_sequencies_list = list(filtered_dna_sequencies_iterator).

Collectives™ on Stack Overflow

Beginner in Python - Filter list of strings based on condition

4 Answers 4

1 Comment

3 Comments

Comments

filter() function -> performance gain

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

3 Comments

Comments

filter() function -> performance gain

Comments

Your Answer

Sign up or log in

Post as a guest

Related