1

I have a pandas dataframe that I would like to pull information from and create a nested dictionary for downstream use, however, I'm not very good at working with pandas yet and I could use some help!

My dataframe looks something like this:

    Sequence    A_start A_stop  B_start B_stop
0   sequence_1  1   25  26  100
1   sequence_2  1   31  32  201
2   sequence_3  1   27  28  231
3   sequence_4  1   39  40  191

I want to write this to a dictionary so that it has this form:

d = {‘Sequnce: {(‘A_start’, ‘A_stop’) : [{'repeat_region':{'rpt_type':'long_terminal_repeat', 'note':"5'LTR"}}], ('B_start', 'B_stop): [{'misc_feature':{'gene': 'Gag', 'note': 'deletion of start codon'}}]}}

and looks like this after it has been generated:

{‘sequence_1’: {(‘1’, ‘25’) : [{'repeat_region':{'rpt_type':'long_terminal_repeat', 'note':"5'LTR"}}], (‘26’, '100’): [{'misc_feature':{'gene': 'Gag', 'note': 'deletion of start codon'}}]}, 
‘sequence_2’: {(‘1’, ‘31’) : [{'repeat_region':{'rpt_type':'long_terminal_repeat', 'note':"5'LTR"}}], ('32', '201’): [{'misc_feature':{'gene': 'Gag', 'note': 'deletion of start codon'}}]}, ...}

I thought a list comprehension might be an easy way to deal with this, but it might end up looking overly complicated. This is what I have so far that clearly doesn't work yet. I'm not sure if I can use iteritems() or something other than groupby() to identify the structure of the entries into the dict. Any help would be appreciated!

LTR_sub_features = [{'repeat_region':{'rpt_type':'long_terminal_repeat', 'note':"5'LTR"}}]
gag_sub_features = [{'misc_feature':{'gene': 'Gag', 'note': 'deletion of start codon'}}]

ltr_gag_dict = {
Sequence: {(A_start,A_end): LTR_sub_features, (B_start,B_end):gag_sub_features} 
for Sequence, A_start, A_end, B_start, B_end in ltr_gag_df.groupby('Sequence')}
1
  • Try pandas.DataFrame.to_dict Commented Aug 4, 2018 at 5:06

1 Answer 1

1

You can use iterrows() to update a dictionary as-you-go:
iterrows() creates a tuple for each row, where the first element (i.e row[0]) is the row's index, and the 2nd element is a pd.Serie object for all the values in the row.

<input>
            A_start A_end   B_start     B_end
sequence_1  0.1     0.025   0.030303    0.001
sequence_2  0.2     0.050   0.060606    0.002
sequence_3  0.3     0.075   0.090909    0.003
sequence_4  0.4     0.100   0.121212    0.004

A_value = 'some value'
B_value = 'other value'
d = dict()


for row in df.iterrows():  
    d[row[0]] = {(row[1]['A_start'], row[1]['A_end']): A_value, (row[1]['B_start'], row[1]['B_end']): B_value}

<output>
{'sequence_1': {(0.10000000000000001, 0.025000000000000001): 'some value', (0.030303030303030304, 0.001): 'other value'}}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.