1

I have a fasta file as follows:

>scaf1
AAAAAATGTGTGTGTGTGTGYAA
AAAAACACGTGTGTGTG
>scaf2
ACGTGTGTGTGATGTGGY
AAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK
>scaf3
AAAGTGTGTTGTGAAACACACYAAW

I want to read it into a dictionary in a away that multiple lines belonging to one sequence go to one key, the output would be:

{'scaf1': 'AAAAAATGTGTGTGTGTGTGYAAAAAAACACGTGTGTGTG', 'scaf2': 'ACGTGTGTGTGATGTGGYAAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK', 'scaf3': 'AAAGTGTGTTGTGAAACACACYAAW'}

The script I have written is:

import sys
from collections import defaultdict

fastaseq = open(sys.argv[1], "r")

def readfasta(fastaseq):
    fasta_dict = {}
    for line in fastaseq:
        if line.startswith('>'):
            header = line.strip('\n')[1:]
            sequence = ''
        else:
            sequence = sequence + line.strip('\n')
        fasta_dict[header] = sequence 
    return fasta_dict

fastadict = readfasta(fastaseq)
print fastadict

It works correctly and fast for such a file but when the file size increases (that is about 1.5 Gb), then it becomes too slow. The step that is taking time is the concatenation part of the sequence. I was wondering if there is any faster way of concatenating the lines to a single string?

2
  • 1
    Maybe sequence += line.strip(...) will be faster, because here you're not extracting the value of sequence, adding data to it and then assigning to sequence again. Just += and that's all. Commented Jun 1, 2016 at 13:35
  • Just changed it in my script, it indeed increased the speed, thanks! Commented Jun 1, 2016 at 13:39

1 Answer 1

5

Concatenating strings with + requires to create a new string since Python strings are immutable, which is time consumer.

Use str.join to concatenate them after all strings are ready,

import sys

def read_fasta(filename):
    fasta_dict = {}
    l = list()
    header = None
    with open(filename, 'r') as f:
        for line in f:
            if line.startswith('>'): # a new record
                # save the previous record to the dict
                if header:
                    fasta_dict[header] = ''.join(l) 
                    del l[:]    # empty the list

                header = line.strip().split('>')[1]
            else:
                l.append(line.strip())

        # save the last record
        fasta_dict[header] = ''.join(l) 

    return fasta_dict

fastadict = read_fasta(sys.argv[1])
print(fastadict)
Sign up to request clarification or add additional context in comments.

1 Comment

Yep, Using a list and joining at the end seems like a good idea here. I expect that appending items to a list is typically faster than concatenating them to a string (something like O(1) vs O(N)). For a header with a substantial number of following lines (say, 100), I'd expect a measurable performance boost.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.