[Python]Creating a for loop, wanting to make a dictionary

Question

I'm trying to create a dictionary by running through a for loop where it would have a description of a bacteria and the key being its DNA sequence. The only problem is that my variable cannot store multiple dataset and it just overwrites the first dataset, thus giving me only one output for my dictionary.

#reads a fasta file and seperates the description and dna sequences
for line in resistance_read:
    if line.startswith(">"):
        description = line
    else: 
        sequence = line

#trying to get the output from the for loop and into the dictionary
bacteria_dict = {description:sequence}

Output:

line3description
dna3sequence

However, with the following code below, I am able to get all the outputs

for line in resistance_read:
    if line.startswith(">"):
       print line
    else: 
       print line

Output:

line1description
line2description
line3description
dna1sequence
dna2sequence
dna3sequence

That's not how variables work in Python (and indeed in most languages). See en.wikibooks.org/wiki/Python_Programming/Variables_and_Strings — Cuadue
– Cuadue, Commented Mar 3, 2015 at 22:13
How's the incoming file look? How do you know which description lines up with which sequence? — MasterOdin
– MasterOdin, Commented Mar 3, 2015 at 22:17
Well my goal is that the for loop will generate multiple outputs, however, I don't know to capture all the outputs and if I assign the outputs to a variable, it will overwrite every time the loop runs. For python, I believe that the variables can be reassigned datasets, they would just overwrite to the latest one. — David
– David, Commented Mar 3, 2015 at 22:29

Community · Accepted Answer · 2017-05-23 10:09:43Z

2

You're constantly overwriting the values of variables in your iterations. sequence and description only hold the last values when the iteration completes.

Instead, create the dictionary first and add to it, as a more complex data structure it can hold more data.

However, there is an easier way...

First you need to open the file and read the lines. To do that you can use the with context manager:

with open('file_path', 'r') as f:
    # used strip() to remove '\n'
    lines = [line.strip() for line in f]

Now that all the lines are in a list called lines, you want to create a mapping between descriptions and sequences.

If the description line is just over the sequence line use this slicing:

# take every other line (intervals of 2) starting from index 0
descriptions = lines[0::2]
sequences = lines[0::2]

Now use zip to zip them together and create a mapping from each pair:

result = dict(zip(descriptions, sequences))

If it's the other way around you can use this which is the exact opposite:

result = dict(zip(lines[1::2], lines[0::2]))

Edit:

Following your update, it seems like the way to do it, assuming there is a description for each sequence (exactly), is splitting the list of lines to half, and then zipping:

middle = len(lines) / 2
result = dict(zip(lines[:mid], lines[mid:]))

edited May 23, 2017 at 10:09

CommunityBot

11 silver badge

answered Mar 3, 2015 at 22:27

Reut Sharabani

31.5k7 gold badges76 silver badges95 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

David Over a year ago

I'm confused to why we need to remove '\n'. Wouldn't that erase the lines?

Reut Sharabani Over a year ago

\n is used to mark a new (visual) line in a string. Once you have the line as a member in a list - you don't need it anymore.

David Over a year ago

I feel like I'm so close of grasping this but I'm still confused as what 'lines' does.

Reut Sharabani Over a year ago

how about printing it? use a small file for testing first.

David Over a year ago

Ohh okay, I see that you are creating a list but why does stripping \n allow me to store each string in? I thought with \n, each entire element is a string and we can store it that way. Without \n, wouldn't it just be one big line? I ran it and I see that removing \n does let me create a list but I still don't get it.

|

Community · Accepted Answer · 2017-05-23 12:26:29Z

Based on the examples you're showing us, it looks like your file format is N lines of description followed by N lines of DNA sequence. This answer assumes that each description or DNA sequence is one line, and that there are as many sequences as there are descriptions.

If you can comfortably fit everything in memory, then the easiest way I can think of is to start as Reut Sharabani suggests above:

with open('file_path', 'r') as f:
    # used strip() to remove '\n'
    lines = [line.strip() for line in f]

Once you have lines, it's easy to create two lists, zip them up, and create a dict:

descriptions = [line for line in lines if line.startswith('>')]
sequences = [line for line in lines if not line.startswith('>')]
result = dict(zip(sequences, descriptions))

However, if the file is very large, and you don't want to do the equivalent of reading its entire length four times, you could process it only once by storing the descriptions, and matching them up with the sequences as the sequences appear.

result = {}
descriptions = []
with open('file_path', 'r') as f:

    line = f.readline().strip()

    while line.startswith('>'):
        descriptions.append(line)
        line = f.readline().strip()

    result[line] = descriptions.pop(0)
    for line in f:
        result[line] = descriptions.pop(0)

Of course this runs into trouble if:

there are not exactly the same number of sequences as descriptions
the sequences are in a different order than the descriptions
the sequences and descriptions are NOT in monolithic blocks.

Collectives™ on Stack Overflow

[Python]Creating a for loop, wanting to make a dictionary

2 Answers 2

Edit:

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Edit:

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related