I have a very big text file like the small example:
small example:
>g1
GAATTCCTTGAGGCCTAAATGCATCGGGGTGCTCTGGTTTTGTTGTTGTTATTTCTGAATGACATTTACTTTGGTGCTCTTTATTTTGCGTATTTAAAAC
>g2
TAAGTCCCTAAGCATATATATAATCATGAGTAGTTGTGGGGAAAATAACACCATTAAATGTACCAAAACAAAAGACCGATCACAAACACTGCCGATGTTTCTCTGGCTTAAATTAAATGTATATACAACTTATATGATAAAATACTGGGC
in the text file there are many parts and every part has 2 lines. the 1st line starts with > and it is called ID and the 2nd line is a sequence of characters. I want to make a dictionary from the text file in python. the key in the dictionary will be the 1st line in the file without > and the values in the resulting dictionary, is a list of tuples. but what is the numbers in the the tuples?
for the tuples I divide the length of each sequence (2nd line of each part) by a fixed number and make a range of numbers. for example in this example I divided by 10. in the expected output, you see that the key is equal to the ID and every tuple belong to each list in the value of each dictionary has 2 numbers, the difference between 2 numbers is 10. the 1st tuple starts with 1 and ends with 10, the 2nd tuple starts with 10 and ends with 20 and this is the case until the end (so, the number of tuples is dependent on the length of sequence in the 2nd line of each part in the text file).
here is the expected output:
expected output:
{ g1: [(1, 10), (10, 20), (20, 30), (30, 40), (40, 50), (50, 60), (60, 70), (70, 80), (80, 90), (90, 100)], g2: [(1, 10), (10, 20), (20, 30), (30, 40), (40, 50), (50, 60), (60, 70), (70, 80), (80, 90), (90, 100), (100, 110), (110, 120), (120, 130), (130, 140), (140, 150)]}
I am trying to do that in python and have tried the following code but did not get what I expect. do you know how to fix the problem?
from itertools import groupby
with open('infile.txt') as f:
groups = groupby(f, key=lambda x: not x.startswith(">"))
d = {}
for k,v in groups:
if not k:
key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
d[key] = val
k = d.keys()
v = d.values()
val = [tuple(len(v)/10)]
{ '..' : [ (a,a+9) for a in range(1,len(g2),10) ] }?