4

I have a list of lists :

sample = [['TTTT', 'CCCZ'], ['ATTA', 'CZZC']]
count = [[4,3],[4,2]]
correctionfactor  = [[1.33, 1.5],[1.33,2]]

I calculate frequency of each character (pi), square it and then sum (and then I calculate het = 1 - sum).

The desired output [[1,2],[1,2]] #NOTE: This is NOT the real values of expected output. I just need the real values to be in this format. 

The problem: I do not how to pass the list of lists(sample, count) in this loop to extract the values needed. I previously passed only a list (eg ['TACT','TTTT'..]) using this code.

  • I suspect that I need to add a larger for loop, that indexes over each element in sample (i.e. indexes over sample[0] = ['TTTT', 'CCCZ'] and sample[1] = ['ATTA', 'CZZC']. I am not sure how to incorporate that into the code.

** Code

list_of_hets = []
for idx, element in enumerate(sample):
    count_dict = {}
    square_dict = {}
    for base in list(element):
         if base in count_dict:
            count_dict[base] += 1
        else:
            count_dict[base] = 1
    for allele in count_dict: #Calculate frequency of every character
        square_freq = (count_dict[allele] / count[idx])**2 #Square the frequencies
        square_dict[allele] = square_freq        
    pf = 0.0
    for i in square_dict:
        pf += square_dict[i]   # pf --> pi^2 + pj^2...pn^2 #Sum the frequencies
    het = 1-pf                    
    list_of_hets.append(het)
print list_of_hets

"Failed" OUTPUT:
line 70, in <module>
square_freq = (count_dict[allele] / count[idx])**2
TypeError: unsupported operand type(s) for /: 'int' and 'list'er
10
  • 1
    The error message tells you exactly what's wrong.: square_freq = (count_dict[allele] / counts[idx])**2 is raising TypeError: unsupported operand type(s) for /: 'int' and 'list'. You can't divide an int by a list. By the way, this doesn't match the code you wrote, which would probably raise another TypeError when you try to pass counts[idx] to float. Commented Oct 11, 2016 at 8:43
  • I am trying to use a zip command like square_freq = [[n/d for n, d in zip(subq, subr)] for subq, subr in zip(count_dict[allele], counts)]. But I'm still having errors. Any other suggestions? Commented Oct 11, 2016 at 8:47
  • @PM2Ring I have corrected it. Thanks for pointing it out Commented Oct 11, 2016 at 8:47
  • What are subq, subr??? Commented Oct 11, 2016 at 9:31
  • 1
    Also, I have edited the question to highlight the real problem (which I realised as i was troubleshooting) Commented Oct 11, 2016 at 9:34

1 Answer 1

3

I'm not completely clear on how you want to handle the 'Z' items in your data, but this code replicates the output for the sample data in https://eval.in/658468

from __future__ import division

bases = set('ACGT')
#sample = [['TTTT', 'CCCZ'], ['ATTA', 'CZZC']]
sample = [['ATTA', 'TTGA'], ['TTCA', 'TTTA']]

list_of_hets = []
for element in sample:
    hets = []
    for seq in element:
        count_dict = {}
        for base in seq:
            if base in count_dict:
                count_dict[base] += 1
            else:
                count_dict[base] = 1
        print count_dict

        #Calculate frequency of every character
        count = sum(1 for u in seq if u in bases)
        pf = sum((base / count) ** 2 for base in count_dict.values())
        hets.append(1 - pf)
    list_of_hets.append(hets)

print list_of_hets

output

{'A': 2, 'T': 2}
{'A': 1, 'T': 2, 'G': 1}
{'A': 1, 'C': 1, 'T': 2}
{'A': 1, 'T': 3}
[[0.5, 0.625], [0.625, 0.375]]

This code could be simplified further by using a collections.Counter instead of the count_dict.

BTW, if the symbol that's not in 'ACGT' is always 'Z' then we can speed up the count calculation. Get rid of bases = set('ACGT') and change

count = sum(1 for u in seq if u in bases)

to

count = sum(1 for u in seq if u != 'Z')
Sign up to request clarification or add additional context in comments.

6 Comments

My final output has to be in the form [[0.5, 0.625],[0.625, 0.375]], because I need be able to distinguish between first element in set1 (['ATTA', 'TTGA']) versus set2['TTCA', 'TTTA']
Also, don't worry about the "Zs" I have figured out a way of handling it :)
@biogeek: That's easy enough to do. See the new version of my answer.
Also, I don't want to use a function that converts a list into a nested list "externally" (like this stackoverflow.com/a/6614975/6824986). This is just a sample data, and I need to be able to make this "split" (list of lists) according to user specified input (eg. If it is [[AA','TT','GG'],['GG','CC',TC'], [AA','TT','GG'] ] ... the final output should have [[1,2,3],[1,2,3], [1,2,3],])
Thanks a LOT! I have been trying to figure this out forever now.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.