I have a pandas dataframe that I would like to pull information from and create a nested dictionary for downstream use, however, I'm not very good at working with pandas yet and I could use some help!
My dataframe looks something like this:
Sequence A_start A_stop B_start B_stop
0 sequence_1 1 25 26 100
1 sequence_2 1 31 32 201
2 sequence_3 1 27 28 231
3 sequence_4 1 39 40 191
I want to write this to a dictionary so that it has this form:
d = {‘Sequnce: {(‘A_start’, ‘A_stop’) : [{'repeat_region':{'rpt_type':'long_terminal_repeat', 'note':"5'LTR"}}], ('B_start', 'B_stop): [{'misc_feature':{'gene': 'Gag', 'note': 'deletion of start codon'}}]}}
and looks like this after it has been generated:
{‘sequence_1’: {(‘1’, ‘25’) : [{'repeat_region':{'rpt_type':'long_terminal_repeat', 'note':"5'LTR"}}], (‘26’, '100’): [{'misc_feature':{'gene': 'Gag', 'note': 'deletion of start codon'}}]},
‘sequence_2’: {(‘1’, ‘31’) : [{'repeat_region':{'rpt_type':'long_terminal_repeat', 'note':"5'LTR"}}], ('32', '201’): [{'misc_feature':{'gene': 'Gag', 'note': 'deletion of start codon'}}]}, ...}
I thought a list comprehension might be an easy way to deal with this, but it might end up looking overly complicated. This is what I have so far that clearly doesn't work yet. I'm not sure if I can use iteritems() or something other than groupby() to identify the structure of the entries into the dict. Any help would be appreciated!
LTR_sub_features = [{'repeat_region':{'rpt_type':'long_terminal_repeat', 'note':"5'LTR"}}]
gag_sub_features = [{'misc_feature':{'gene': 'Gag', 'note': 'deletion of start codon'}}]
ltr_gag_dict = {
Sequence: {(A_start,A_end): LTR_sub_features, (B_start,B_end):gag_sub_features}
for Sequence, A_start, A_end, B_start, B_end in ltr_gag_df.groupby('Sequence')}