4

I have a pandas DataFrame that looks like the following:

     bus_uid   bus_type    type                      obj_uid  \
0     biomass: DEB31    biomass  output       Simple_139804698384200   
0     biomass: DEB31    biomass   other                        duals   
0     biomass: DEB31    biomass   other                       excess   

                                         datetime  \
0   DatetimeIndex(['2015-01-01 00:00:00',  '2015-01-01 01:00:00',  '2015-01-01 02:00:00', ...   
0   DatetimeIndex(['2015-01-01 00:00:00',  '2015-01-01 01:00:00',  '2015-01-01 02:00:00', ...   
0   DatetimeIndex(['2015-01-01 00:00:00',  '2015-01-01 01:00:00',  '2015-01-01 02:00:00', ...   

                                           values  
0   [1.0, 2.0, 3.0, ...  
0   [4.0, 5.0, 6.0, ...  
0   [7.0, 8.0, 9.0, ...

And want to convert it into the following format:

     bus_uid   bus_type    type                          obj_uid  datetime             values
0     biomass: DEB31    biomass  output   Simple_139804698384200  2015-01-01 00:00:00  1.0
0     biomass: DEB31    biomass  output   Simple_139804698384200  2015-01-01 01:00:00  2.0
0     biomass: DEB31    biomass  output   Simple_139804698384200  2015-01-01 02:00:00  3.0
0     biomass: DEB31    biomass   other                    duals  2015-01-01 00:00:00  4.0
0     biomass: DEB31    biomass   other                    duals  2015-01-01 01:00:00  5.0
0     biomass: DEB31    biomass   other                    duals  2015-01-01 02:00:00  6.0
0     biomass: DEB31    biomass   other                   excess  2015-01-01 00:00:00  7.0
0     biomass: DEB31    biomass   other                   excess  2015-01-01 01:00:00  8.0
0     biomass: DEB31    biomass   other                   excess  2015-01-01 02:00:00  9.0

The columns datetime and values have the same dimension.

I have already asked a similar question here but couldn't manage to apply the solution for my problem with two columns.

What's the best way to convert the DataFrame into the required format?

2 Answers 2

2

You could iterate through the rows to extract the Index and Series info from the cells. I don't think that reshaping methods work well when you need to extract info at the same time:

Sample data:

rows = 3
df = pd.DataFrame(data={'bus_uid': list(repeat('biomass: DEB31', rows)), 'type': list(repeat('biomass', 3)), 'id': ['id1', 'id2', 'id3'], 'datetime': list(repeat(pd.DatetimeIndex(start=datetime(2016,1,1), periods=3, freq='D'), rows)), 'values': list(repeat([1,2,3], rows))})

          bus_uid                                           datetime   id  \
0  biomass: DEB31  DatetimeIndex(['2016-01-01', '2016-01-02', '20...  id1   
1  biomass: DEB31  DatetimeIndex(['2016-01-01', '2016-01-02', '20...  id2   
2  biomass: DEB31  DatetimeIndex(['2016-01-01', '2016-01-02', '20...  id3   

      type     values  
0  biomass  [1, 2, 3]  
1  biomass  [1, 2, 3]  
2  biomass  [1, 2, 3]  

Build new DataFrame as you iterate through the DataFrame rows:

new_df = pd.DataFrame()
for index, cols in df.iterrows():
    extract_df = pd.DataFrame.from_dict({'datetime': cols.ix['datetime'], 'values': cols.ix['values']})
    extract_df = pd.concat([extract_df, cols.drop(['datetime', 'values']).to_frame().T], axis=1).fillna(method='ffill').fillna(method='bfill')
    new_df = pd.concat([new_df, extract_df], ignore_index=True)

to get:

    datetime  values         bus_uid   id     type
0 2016-01-01       1  biomass: DEB31  id1  biomass
1 2016-01-02       2  biomass: DEB31  id1  biomass
2 2016-01-03       3  biomass: DEB31  id1  biomass
3 2016-01-01       1  biomass: DEB31  id2  biomass
4 2016-01-02       2  biomass: DEB31  id2  biomass
5 2016-01-03       3  biomass: DEB31  id2  biomass
6 2016-01-01       1  biomass: DEB31  id3  biomass
7 2016-01-02       2  biomass: DEB31  id3  biomass
8 2016-01-03       3  biomass: DEB31  id3  biomass
Sign up to request clarification or add additional context in comments.

Comments

2

You can extract from columns values and datetime new Series and then merge them with original dataframe df by concat:

s1 = df['values'].apply(pd.Series, 1).stack()
s1.index = s1.index.droplevel(-1) # to line up with df's index
s1.name = 'values' # needs a name to join

s2 = df['datetime'].apply(pd.Series, 1).stack()
s2.index = s2.index.droplevel(-1) # to line up with df's index
s2.name = 'datetime' # needs a name to join

#remove duplicity columns
df = df.drop( ['values', 'datetime'], axis=1)

#concat all together
df= pd.concat([df,s1,s2], axis=1).reset_index(drop=True)
print df

   bus_uid        bus_type            type                 obj_uid values  \
0        0  biomass: DEB31  biomass output  Simple_139804698384200    1.0   
1        0  biomass: DEB31  biomass output  Simple_139804698384200    2.0   
2        0  biomass: DEB31  biomass output  Simple_139804698384200    3.0   
3        0  biomass: DEB31   biomass other                   duals    4.0   
4        0  biomass: DEB31   biomass other                   duals    5.0   
5        0  biomass: DEB31   biomass other                   duals    6.0   
6        0  biomass: DEB31   biomass other                  excess    7.0   
7        0  biomass: DEB31   biomass other                  excess    8.0   
8        0  biomass: DEB31   biomass other                  excess    9.0   

             datetime  
0 2015-01-01 00:00:00  
1 2015-01-01 01:00:00  
2 2015-01-01 02:00:00  
3 2015-01-01 00:00:00  
4 2015-01-01 01:00:00  
5 2015-01-01 02:00:00  
6 2015-01-01 00:00:00  
7 2015-01-01 01:00:00  
8 2015-01-01 02:00:00  

5 Comments

Thanks for all your answers! I have chosen jezraels answer since it is the most intuitive option in my opinion.
But I haven't thought about performance issues. With "datetime" and "values" having a dimension of 192 entries each, the reshaping already takes more than 8 hours. With my current dataframe and a max. dimension of 8760 entries for "datetime" and "values", reshaping into long format would lead to a dataframe with more than 70.000.000 entries. Maybe it is then smarter to get the required subsets first and then to create the long format for statistics and plotting. Do you have any recommendations on this?
Hmmm, if DataFrame is huge it is complicated. You can try change s1 = df['values'].apply(pd.Series, 1).stack() to s1 = pd.DataFrame([x for x in df['values']]).stack() and with datetime similar.
Stefans answer took only 240 seconds for the creation, so I changed my accepted answer for now. I'll do some performance testing later including your suggestion and then see which one suits best for me. Thanks!
It took more than 5 hours with your suggestion so the other option seems to be much faster.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.