0

I am reading a FASTA file that has a format like this:

>gi|31563518|ref|NP_852610.1| microtubule-associated proteins 1A/1B light chain 3A isoform b [Homo sapiens]
MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKIIRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGF 

I have to read the file and then calculate the JC distance (For a pair of sequences, the JC distance is -3/4 * ln(1 - 4/3 * p), where p is the proportion of sites that differ between the pair)

I have set up the skeleton of it but am unsure how to do the rest. AFter reading and calculating the JukesCantor distance I have to write it to a new output file and it should be in a table any help i can get is much appreciated! thanks, new to python AND fasta files

def readData():
    filename = input("Enter the name of the FASTA file: ")
    infile = open(filename, "r")

def CalculateJC(x,y):
    if x == y:
        return 0
    else:
        return 1 # temporary*

def calcDists(seqs):
    output = []
    for seq1 in seqs:
        newrow = []
        for seq2 in seqs:
            dist = calculateJS(seq1,seq2)
            newrow.append(dist)
        output.append(newrow)
        list(enumerate(seasons))
    return output


def outputDists(distMat):
    pass

def main():
    seqs = readData()
    distMat = calcDists(seqs)
    outputDists(distMat)



if__name__ == "__main__":
    main()
1
  • 5
    Take a look at biopython. It supports reading and writing of fasta files. Commented Nov 12, 2014 at 4:45

1 Answer 1

1

You are asking too many questions at a time! Focus on one.

Reading and writing FASTA files is covered in BioPython (as suggested in comments).

I noticed that you aren't calculating your JC distance yet, so perhaps this is where you need help. Here is what I came up with:

import itertools, math

def computeJC(seq1, seq2):
    equal = 0
    for base1, base2 in itertools.izip(seq1, seq2):
        equal += (base1 == base2)
    p = equal / float(len(seq1))     
    return -3/4 * math.log(1 - 4/3 * p)  

The itertools.izip trick is explained here: How can I iterate through two lists in parallel This piece of code will work with any kind of string, and the look will stop when either seq1 or seq2 reaches the end.

Someone else may come up with a "Pythonic one-liner", but try to understand my approach first. It avoids the pitfalls that your code felt into: nested loops, unnecessary branching, runtime list growing, spaghetti code to name a few. Enjoy!

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.