4

I have a list of dictionaries with this structure.

    {
        'data' : [[year1, value1], [year2, value2], ... m entries],
        'description' : string,
        'end' : string,
        'f' : string,
        'lastHistoricalperiod' : string, 
        'name' : string,
        'series_id' : string,
        'start' : int,
        'units' : string,
        'unitsshort' : string,
        'updated' : string
    }

I want to put this in a pandas DataFrame that looks like

   year       value  updated                   (other dict keys ... )
0  2040  120.592468  2014-05-23T12:06:16-0400  other key-values
1  2039  120.189987  2014-05-23T12:06:16-0400  ...
2  other year-value pairs ...
...
n

where n = m* len(list with dictionaries) (where length of each list in 'data' = m)

That is, each tuple in 'data' should have its own row. What I've done thus far is this:

x = [list of dictionaries as described above]
# Create Empty Data Frame
output = pd.DataFrame()

    # Loop through each dictionary in the list
    for dictionary in x:
        # Create a new DataFrame from the 2-D list alone.
        data = dictionary['data']
        y = pd.DataFrame(data, columns = ['year', 'value'])
        # Loop through all the other dictionary key-value pairs and fill in values
        for key in dictionary:
            if key != 'data':
                y[key] = dictionary[key]
        # Concatenate most recent output with the dframe from this dictionary.
        output = pd.concat([output_frame, y], ignore_index = True)

This seems very hacky, and I was wondering if there's a more 'pythonic' way to do this, or at least if there are any obvious speedups here.

2 Answers 2

4

If Your data is in the form [{},{},...] you can do the following...

The issue with your data is in the data key of your dictionaries.

df = pd.DataFrame(data)
fix = df.groupby(level=0)['data'].apply(lambda x:pd.DataFrame(x.iloc[0],columns = ['Year','Value']))
fix = fix.reset_index(level=1,drop=True)
df = pd.merge(fix,df.drop(['data'],1),how='inner',left_index=True,right_index=True)

The code does the following...

  1. Creates a DataFrame with your list of dictionaries
  2. creates a new dataframe by stretching out your data column into more rows
  3. The stretching line has caused a multiindex with an irrelevant column - this removes it
  4. Finally merge on the original index and get desired DataFrame
Sign up to request clarification or add additional context in comments.

1 Comment

I like this solution a lot. Since everything begins in a dataframe, there's very little opportunity for modifications to the code to decouple each 'data' list with the other heading information.
0

Some data would have been helpful when answering this question. However, from your data structure some example data might look like this:

dict_list = [{'data'            : [['1999', 1], ['2000', 2], ['2001', 3]],
              'description'     : 'foo_dictionary',
              'end'             : 'foo1',
              'f'               : 'foo2',},
             {'data'            : [['2002', 4], ['2003', 5]],
              'description'     : 'bar_dictionary',
              'end'             : 'bar1',
              'f'               : 'bar2',}
             ]

My suggestion would be to manipulate and reshape this data into a new dictionary and then simply pass that dictionary to the DataFrame constructor. In order to pass a dictionary to the pd.DataFrame constructor you could very simply reshape the data into a new dict as follows:

data_dict = {'years'        : [],
             'value'        : [],
             'description'  : [],
             'end'          : [],
             'f'            : [],}

for dictionary in dict_list:
    data_dict['years'].extend([elem[0] for elem in dictionary['data']])
    data_dict['value'].extend([elem[1] for elem in dictionary['data']])
    data_dict['description'].extend(dictionary['description'] for x in xrange(len(dictionary['data'])))
    data_dict['end'].extend(dictionary['end'] for x in xrange(len(dictionary['data'])))
    data_dict['f'].extend(dictionary['f'] for x in xrange(len(dictionary['data'])))

and then just pass this to pandas

import pandas as pd
pd.DataFrame(data_dict)

which gives me the following output:

      description   end     f  value years
0  foo_dictionary  foo1  foo2      1  1999
1  foo_dictionary  foo1  foo2      2  2000
2  foo_dictionary  foo1  foo2      3  2001
3  bar_dictionary  bar1  bar2      4  2002
4  bar_dictionary  bar1  bar2      5  2003

I would say that if this is the type of output you want, then this system would be a decent simplification.

In fact you could simplify it even further by creating a year:value dictionary, and a dict for the other vals. Then you would not have to type out the new dictionary and you could run a nested for loop. This could look as follows:

year_val_dict = {'years'        : [],
                 'value'        : []}
other_val_dict = {_key : [] for _key in dict_list[0] if _key!='data'}

for dictionary in dict_list:
    year_val_dict['years'].extend([elem[0] for elem in dictionary['data']])
    year_val_dict['value'].extend([elem[1] for elem in dictionary['data']])
    for _key in other_val_dict:
        other_val_dict[_key].extend(dictionary[_key] for x in xrange(len(dictionary['data'])))

year_val_dict.update(other_val_dict)
pd.DataFrame(year_val_dict)

NB this of course assumes that all the dicts in the dict_list have the same structure....

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.