0

I'm trying to create a dictionary by running through a for loop where it would have a description of a bacteria and the key being its DNA sequence. The only problem is that my variable cannot store multiple dataset and it just overwrites the first dataset, thus giving me only one output for my dictionary.

#reads a fasta file and seperates the description and dna sequences
for line in resistance_read:
    if line.startswith(">"):
        description = line
    else: 
        sequence = line

#trying to get the output from the for loop and into the dictionary
bacteria_dict = {description:sequence}

Output:

line3description
dna3sequence

However, with the following code below, I am able to get all the outputs

for line in resistance_read:
    if line.startswith(">"):
       print line
    else: 
       print line

Output:

line1description
line2description
line3description
dna1sequence
dna2sequence
dna3sequence
3
  • That's not how variables work in Python (and indeed in most languages). See en.wikibooks.org/wiki/Python_Programming/Variables_and_Strings Commented Mar 3, 2015 at 22:13
  • 1
    How's the incoming file look? How do you know which description lines up with which sequence? Commented Mar 3, 2015 at 22:17
  • Well my goal is that the for loop will generate multiple outputs, however, I don't know to capture all the outputs and if I assign the outputs to a variable, it will overwrite every time the loop runs. For python, I believe that the variables can be reassigned datasets, they would just overwrite to the latest one. Commented Mar 3, 2015 at 22:29

2 Answers 2

2

You're constantly overwriting the values of variables in your iterations. sequence and description only hold the last values when the iteration completes.

Instead, create the dictionary first and add to it, as a more complex data structure it can hold more data.


However, there is an easier way...

First you need to open the file and read the lines. To do that you can use the with context manager:

with open('file_path', 'r') as f:
    # used strip() to remove '\n'
    lines = [line.strip() for line in f]

Now that all the lines are in a list called lines, you want to create a mapping between descriptions and sequences.

If the description line is just over the sequence line use this slicing:

# take every other line (intervals of 2) starting from index 0
descriptions = lines[0::2]
sequences = lines[0::2]

Now use zip to zip them together and create a mapping from each pair:

result = dict(zip(descriptions, sequences))

If it's the other way around you can use this which is the exact opposite:

result = dict(zip(lines[1::2], lines[0::2]))

Edit:

Following your update, it seems like the way to do it, assuming there is a description for each sequence (exactly), is splitting the list of lines to half, and then zipping:

middle = len(lines) / 2
result = dict(zip(lines[:mid], lines[mid:]))
Sign up to request clarification or add additional context in comments.

7 Comments

I'm confused to why we need to remove '\n'. Wouldn't that erase the lines?
\n is used to mark a new (visual) line in a string. Once you have the line as a member in a list - you don't need it anymore.
I feel like I'm so close of grasping this but I'm still confused as what 'lines' does.
how about printing it? use a small file for testing first.
Ohh okay, I see that you are creating a list but why does stripping \n allow me to store each string in? I thought with \n, each entire element is a string and we can store it that way. Without \n, wouldn't it just be one big line? I ran it and I see that removing \n does let me create a list but I still don't get it.
|
0

Based on the examples you're showing us, it looks like your file format is N lines of description followed by N lines of DNA sequence. This answer assumes that each description or DNA sequence is one line, and that there are as many sequences as there are descriptions.

If you can comfortably fit everything in memory, then the easiest way I can think of is to start as Reut Sharabani suggests above:

with open('file_path', 'r') as f:
    # used strip() to remove '\n'
    lines = [line.strip() for line in f]

Once you have lines, it's easy to create two lists, zip them up, and create a dict:

descriptions = [line for line in lines if line.startswith('>')]
sequences = [line for line in lines if not line.startswith('>')]
result = dict(zip(sequences, descriptions))

However, if the file is very large, and you don't want to do the equivalent of reading its entire length four times, you could process it only once by storing the descriptions, and matching them up with the sequences as the sequences appear.

result = {}
descriptions = []
with open('file_path', 'r') as f:

    line = f.readline().strip()

    while line.startswith('>'):
        descriptions.append(line)
        line = f.readline().strip()

    result[line] = descriptions.pop(0)
    for line in f:
        result[line] = descriptions.pop(0)

Of course this runs into trouble if:

  • there are not exactly the same number of sequences as descriptions
  • the sequences are in a different order than the descriptions
  • the sequences and descriptions are NOT in monolithic blocks.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.