I have a fasta file as follows:
>scaf1
AAAAAATGTGTGTGTGTGTGYAA
AAAAACACGTGTGTGTG
>scaf2
ACGTGTGTGTGATGTGGY
AAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK
>scaf3
AAAGTGTGTTGTGAAACACACYAAW
I want to read it into a dictionary in a away that multiple lines belonging to one sequence go to one key, the output would be:
{'scaf1': 'AAAAAATGTGTGTGTGTGTGYAAAAAAACACGTGTGTGTG', 'scaf2': 'ACGTGTGTGTGATGTGGYAAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK', 'scaf3': 'AAAGTGTGTTGTGAAACACACYAAW'}
The script I have written is:
import sys
from collections import defaultdict
fastaseq = open(sys.argv[1], "r")
def readfasta(fastaseq):
fasta_dict = {}
for line in fastaseq:
if line.startswith('>'):
header = line.strip('\n')[1:]
sequence = ''
else:
sequence = sequence + line.strip('\n')
fasta_dict[header] = sequence
return fasta_dict
fastadict = readfasta(fastaseq)
print fastadict
It works correctly and fast for such a file but when the file size increases (that is about 1.5 Gb), then it becomes too slow. The step that is taking time is the concatenation part of the sequence. I was wondering if there is any faster way of concatenating the lines to a single string?
sequence += line.strip(...)will be faster, because here you're not extracting the value ofsequence, adding data to it and then assigning tosequenceagain. Just+=and that's all.