Pandas: How do I split multiple lists in columns into multiple rows?

Question

I have a pandas DataFrame that looks like the following:

     bus_uid   bus_type    type                      obj_uid  \
0     biomass: DEB31    biomass  output       Simple_139804698384200   
0     biomass: DEB31    biomass   other                        duals   
0     biomass: DEB31    biomass   other                       excess   

                                         datetime  \
0   DatetimeIndex(['2015-01-01 00:00:00',  '2015-01-01 01:00:00',  '2015-01-01 02:00:00', ...   
0   DatetimeIndex(['2015-01-01 00:00:00',  '2015-01-01 01:00:00',  '2015-01-01 02:00:00', ...   
0   DatetimeIndex(['2015-01-01 00:00:00',  '2015-01-01 01:00:00',  '2015-01-01 02:00:00', ...   

                                           values  
0   [1.0, 2.0, 3.0, ...  
0   [4.0, 5.0, 6.0, ...  
0   [7.0, 8.0, 9.0, ...

And want to convert it into the following format:

     bus_uid   bus_type    type                          obj_uid  datetime             values
0     biomass: DEB31    biomass  output   Simple_139804698384200  2015-01-01 00:00:00  1.0
0     biomass: DEB31    biomass  output   Simple_139804698384200  2015-01-01 01:00:00  2.0
0     biomass: DEB31    biomass  output   Simple_139804698384200  2015-01-01 02:00:00  3.0
0     biomass: DEB31    biomass   other                    duals  2015-01-01 00:00:00  4.0
0     biomass: DEB31    biomass   other                    duals  2015-01-01 01:00:00  5.0
0     biomass: DEB31    biomass   other                    duals  2015-01-01 02:00:00  6.0
0     biomass: DEB31    biomass   other                   excess  2015-01-01 00:00:00  7.0
0     biomass: DEB31    biomass   other                   excess  2015-01-01 01:00:00  8.0
0     biomass: DEB31    biomass   other                   excess  2015-01-01 02:00:00  9.0

The columns datetime and values have the same dimension.

I have already asked a similar question here but couldn't manage to apply the solution for my problem with two columns.

What's the best way to convert the DataFrame into the required format?

Stefan · Accepted Answer · 2016-01-02 18:57:17Z

You could iterate through the rows to extract the Index and Series info from the cells. I don't think that reshaping methods work well when you need to extract info at the same time:

Sample data:

rows = 3
df = pd.DataFrame(data={'bus_uid': list(repeat('biomass: DEB31', rows)), 'type': list(repeat('biomass', 3)), 'id': ['id1', 'id2', 'id3'], 'datetime': list(repeat(pd.DatetimeIndex(start=datetime(2016,1,1), periods=3, freq='D'), rows)), 'values': list(repeat([1,2,3], rows))})

          bus_uid                                           datetime   id  \
0  biomass: DEB31  DatetimeIndex(['2016-01-01', '2016-01-02', '20...  id1   
1  biomass: DEB31  DatetimeIndex(['2016-01-01', '2016-01-02', '20...  id2   
2  biomass: DEB31  DatetimeIndex(['2016-01-01', '2016-01-02', '20...  id3   

      type     values  
0  biomass  [1, 2, 3]  
1  biomass  [1, 2, 3]  
2  biomass  [1, 2, 3]

Build new DataFrame as you iterate through the DataFrame rows:

new_df = pd.DataFrame()
for index, cols in df.iterrows():
    extract_df = pd.DataFrame.from_dict({'datetime': cols.ix['datetime'], 'values': cols.ix['values']})
    extract_df = pd.concat([extract_df, cols.drop(['datetime', 'values']).to_frame().T], axis=1).fillna(method='ffill').fillna(method='bfill')
    new_df = pd.concat([new_df, extract_df], ignore_index=True)

to get:

    datetime  values         bus_uid   id     type
0 2016-01-01       1  biomass: DEB31  id1  biomass
1 2016-01-02       2  biomass: DEB31  id1  biomass
2 2016-01-03       3  biomass: DEB31  id1  biomass
3 2016-01-01       1  biomass: DEB31  id2  biomass
4 2016-01-02       2  biomass: DEB31  id2  biomass
5 2016-01-03       3  biomass: DEB31  id2  biomass
6 2016-01-01       1  biomass: DEB31  id3  biomass
7 2016-01-02       2  biomass: DEB31  id3  biomass
8 2016-01-03       3  biomass: DEB31  id3  biomass

jezrael · Accepted Answer · 2016-01-02 18:50:08Z

2

You can extract from columns values and datetime new Series and then merge them with original dataframe df by concat:

s1 = df['values'].apply(pd.Series, 1).stack()
s1.index = s1.index.droplevel(-1) # to line up with df's index
s1.name = 'values' # needs a name to join

s2 = df['datetime'].apply(pd.Series, 1).stack()
s2.index = s2.index.droplevel(-1) # to line up with df's index
s2.name = 'datetime' # needs a name to join

#remove duplicity columns
df = df.drop( ['values', 'datetime'], axis=1)

#concat all together
df= pd.concat([df,s1,s2], axis=1).reset_index(drop=True)

print df

   bus_uid        bus_type            type                 obj_uid values  \
0        0  biomass: DEB31  biomass output  Simple_139804698384200    1.0   
1        0  biomass: DEB31  biomass output  Simple_139804698384200    2.0   
2        0  biomass: DEB31  biomass output  Simple_139804698384200    3.0   
3        0  biomass: DEB31   biomass other                   duals    4.0   
4        0  biomass: DEB31   biomass other                   duals    5.0   
5        0  biomass: DEB31   biomass other                   duals    6.0   
6        0  biomass: DEB31   biomass other                  excess    7.0   
7        0  biomass: DEB31   biomass other                  excess    8.0   
8        0  biomass: DEB31   biomass other                  excess    9.0   

             datetime  
0 2015-01-01 00:00:00  
1 2015-01-01 01:00:00  
2 2015-01-01 02:00:00  
3 2015-01-01 00:00:00  
4 2015-01-01 01:00:00  
5 2015-01-01 02:00:00  
6 2015-01-01 00:00:00  
7 2015-01-01 01:00:00  
8 2015-01-01 02:00:00

edited Jan 2, 2016 at 18:50

answered Jan 2, 2016 at 18:30

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

5 Comments

Cord Kaldemeyer Over a year ago

Thanks for all your answers! I have chosen jezraels answer since it is the most intuitive option in my opinion.

Cord Kaldemeyer Over a year ago

But I haven't thought about performance issues. With "datetime" and "values" having a dimension of 192 entries each, the reshaping already takes more than 8 hours. With my current dataframe and a max. dimension of 8760 entries for "datetime" and "values", reshaping into long format would lead to a dataframe with more than 70.000.000 entries. Maybe it is then smarter to get the required subsets first and then to create the long format for statistics and plotting. Do you have any recommendations on this?

jezrael Over a year ago

Hmmm, if DataFrame is huge it is complicated. You can try change s1 = df['values'].apply(pd.Series, 1).stack() to s1 = pd.DataFrame([x for x in df['values']]).stack() and with datetime similar.

Cord Kaldemeyer Over a year ago

Stefans answer took only 240 seconds for the creation, so I changed my accepted answer for now. I'll do some performance testing later including your suggestion and then see which one suits best for me. Thanks!

Cord Kaldemeyer Over a year ago

It took more than 5 hours with your suggestion so the other option seems to be much faster.

Collectives™ on Stack Overflow

Pandas: How do I split multiple lists in columns into multiple rows?

2 Answers 2

Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related