Concatenating lines to a string in python

Question

I have a fasta file as follows:

>scaf1
AAAAAATGTGTGTGTGTGTGYAA
AAAAACACGTGTGTGTG
>scaf2
ACGTGTGTGTGATGTGGY
AAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK
>scaf3
AAAGTGTGTTGTGAAACACACYAAW

I want to read it into a dictionary in a away that multiple lines belonging to one sequence go to one key, the output would be:

{'scaf1': 'AAAAAATGTGTGTGTGTGTGYAAAAAAACACGTGTGTGTG', 'scaf2': 'ACGTGTGTGTGATGTGGYAAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK', 'scaf3': 'AAAGTGTGTTGTGAAACACACYAAW'}

The script I have written is:

import sys
from collections import defaultdict

fastaseq = open(sys.argv[1], "r")

def readfasta(fastaseq):
    fasta_dict = {}
    for line in fastaseq:
        if line.startswith('>'):
            header = line.strip('\n')[1:]
            sequence = ''
        else:
            sequence = sequence + line.strip('\n')
        fasta_dict[header] = sequence 
    return fasta_dict

fastadict = readfasta(fastaseq)
print fastadict

It works correctly and fast for such a file but when the file size increases (that is about 1.5 Gb), then it becomes too slow. The step that is taking time is the concatenation part of the sequence. I was wondering if there is any faster way of concatenating the lines to a single string?

Maybe sequence += line.strip(...) will be faster, because here you're not extracting the value of sequence, adding data to it and then assigning to sequence again. Just += and that's all. — ForceBru
– ForceBru, Commented Jun 1, 2016 at 13:35
Just changed it in my script, it indeed increased the speed, thanks! — Homap
– Homap, Commented Jun 1, 2016 at 13:39

SparkAndShine · Accepted Answer · 2016-06-01 14:19:05Z

5

Concatenating strings with + requires to create a new string since Python strings are immutable, which is time consumer.

Use str.join to concatenate them after all strings are ready,

import sys

def read_fasta(filename):
    fasta_dict = {}
    l = list()
    header = None
    with open(filename, 'r') as f:
        for line in f:
            if line.startswith('>'): # a new record
                # save the previous record to the dict
                if header:
                    fasta_dict[header] = ''.join(l) 
                    del l[:]    # empty the list

                header = line.strip().split('>')[1]
            else:
                l.append(line.strip())

        # save the last record
        fasta_dict[header] = ''.join(l) 

    return fasta_dict

fastadict = read_fasta(sys.argv[1])
print(fastadict)

edited Jun 1, 2016 at 14:19

answered Jun 1, 2016 at 13:48

SparkAndShine

18.2k27 gold badges99 silver badges140 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Kevin Over a year ago

Yep, Using a list and joining at the end seems like a good idea here. I expect that appending items to a list is typically faster than concatenating them to a string (something like O(1) vs O(N)). For a header with a substantial number of following lines (say, 100), I'd expect a measurable performance boost.

Collectives™ on Stack Overflow

Concatenating lines to a string in python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related