I'm new with Python, and I'm trying to filter a list of DNA sequences (strings) based on a condition.
Given my random list of DNA sequences, I want to keep in the list only the sequences in which A=T and C=G.
This is what I did so far, but I obtain the position of the string and not the string. I wonder how could I obtain a string in my output. Any advice would be great, thank you!
This is what I tried to do so far:
import numpy as np
BASES = ('A','C','T','G')
P = (0.2, 0.3, 0.2, 0.3)
def random_dna_sequence(length):
return ''.join(np.random.choice(BASES, p=P) for _ in range(length))
dna = [random_dna_sequence(20) for _ in range(300)] #dna1=300 sequences of 20 characters each
print(dna)
Then, I tried to obtain the list of sequences that only accomplish this condition:
# Obtain a list dna_2 with DNA sequences that accomplish this condition only
dna_2 = [i for i in range(len(dna)) if (dna[i].count('A'))== (dna[i].count('T')) and (dna[i].count('C'))== (dna[i].count('G'))]
print(dna_2)
My output is returning the position of the sequences that accomplish this condition, but not the sequence itself (the sequence):
[29, 41, 66, 85, 88, 117, 142, 174, 201, 226, 231, 246, 250, 279, 294, 299, 306, 338, 370, 372, 381, 404, 420, 486, 519, 579]
And my desire output should be:
['AACTGACTTG', ...]
Thank you all!