3

I have a question regarding duplicating rows in a pandas dataframe. I have allocated relevant dates to each observation in the column "relevant shocks" in lists. Observation 22 has an empty list, 23 a list with one date, 24 a list with two dates and 25 a list with three dates (as seen in column "listlength").

My aim is to expand the dataframe in the way that observations with empty lists remain in the dataset with one row while rows with x observations get duplicated x times - as such, rows 22 and 23 should stay in the dataframe once (22 despite the empty list and 23 because it has one relevant date), row 24 should get duplicated once and thus be in the dataframe twice and observation 25 should be duplicated twice and thus be in the dataframe thrice. As such, each row should be in the dataframe as many times as it has relevant shocks (as measured by listlength). Except for the ones with list length 0, they should still remain in the dataframe.

Further, I want to create a new column "relevant shock" which is filled by each of the relevant shocks once and separately.

This is the current dataframe:

    quarter year    pddate      relevant shocks                                                 listlength
22  1       2012    2012-02-15  []                                                              0.0
23  4       2011    2011-11-15  [2011-08-18 00:00:00]                                           1.0
24  3       2011    2011-08-15  [2011-08-18 00:00:00, 2011-09-22 00:00:00]                      2.0
25  2       2011    2011-05-13  [2011-08-04 00:00:00, 2011-08-08 00:00:00, 2011-08-10 00:00:00  3.0

The new dataframe should have 7 rows and look as follows:

    quarter year    pddate      relevant shocks                                                 listlength    relevant shock
22  1       2012    2012-02-15  []                                                              0.0
23  4       2011    2011-11-15  [2011-08-18 00:00:00]                                           1.0           2011-08-18 00:00:00
24  3       2011    2011-08-15  [2011-08-18 00:00:00, 2011-09-22 00:00:00]                      2.0           2011-08-18 00:00:00
25  3       2011    2011-08-15  [2011-08-18 00:00:00, 2011-09-22 00:00:00]                      2.0           2011-09-22 00:00:00
26  2       2011    2011-05-13  [2011-08-04 00:00:00, 2011-08-08 00:00:00, 2011-08-10 00:00:00  3.0           2011-08-04 00:00:00
27  2       2011    2011-05-13  [2011-08-04 00:00:00, 2011-08-08 00:00:00, 2011-08-10 00:00:00  3.0           2011-08-08 00:00:00
28  2       2011    2011-05-13  [2011-08-04 00:00:00, 2011-08-08 00:00:00, 2011-08-10 00:00:00  3.0           2011-08-10 00:00:00

So the basic idea would be to add the new column "relevant shock", go through each row, keep it unchanged if it has an empty list in "relevant shocks", also keep it unchanged if it has one date in "relevant shocks", but fill the new column "relevant shock" with that one list entry, duplicate it if it has two list entries in "relevant shocks" and fill the column "relevant shock" in each row with one of the two list entries, respectively, and so on.

Is this possible with Python?

2 Answers 2

8

EDIT for pandas version >= 0.25, a new method explode would do the job really easily:

#first create a copy of the column
df['relevant shock'] = df['relevant shocks']
#explode the new column
df = df.explode('relevant shock').fillna('')
print (df)
#same result than the one below

Old answer

From the column 'relevant shocks' you can use apply, pd.Series and stack to create a row for each date, such as:

df['relevant shocks'].apply(pd.Series).stack()
Out[448]: 
23  0    2011-08-18 00:00:00
24  0    2011-08-18 00:00:00
    1    2011-09-22 00:00:00
25  0    2011-08-04 00:00:00
    1    2011-08-08 00:00:00
    2    2011-08-10 00:00:00
dtype: object

I know the one empty is missing but after you join the result to your df with a reset_index, fillna and drop the extra column. With a df like this:

df = pd.DataFrame({'quarter':[1,2,3,4],
                   'relevant shocks':[[],['2011-08-18 00:00:00'],
                      ['2011-08-18 00:00:00', '2011-09-22 00:00:00'],
                      ['2011-08-04 00:00:00', '2011-08-08 00:00:00', '2011-08-10 00:00:00']]},
                   index=[22,23,24,25])

then you do:

df = (df.join(df['relevant shocks'].apply(pd.Series).stack()
                                   .reset_index(1,name='relevant shock'))
         .fillna('').drop('level_1',1))

and you get:

    quarter                                    relevant shocks  \
22        1                                                 []   
23        2                              [2011-08-18 00:00:00]   
24        3         [2011-08-18 00:00:00, 2011-09-22 00:00:00]   
24        3         [2011-08-18 00:00:00, 2011-09-22 00:00:00]   
25        4  [2011-08-04 00:00:00, 2011-08-08 00:00:00, 201...   
25        4  [2011-08-04 00:00:00, 2011-08-08 00:00:00, 201...   
25        4  [2011-08-04 00:00:00, 2011-08-08 00:00:00, 201...   

         relevant shock  
22                       
23  2011-08-18 00:00:00  
24  2011-08-18 00:00:00  
24  2011-09-22 00:00:00  
25  2011-08-04 00:00:00  
25  2011-08-08 00:00:00  
25  2011-08-10 00:00:00  

EDIT: it seems that for the real data, an error occured with empty list, so to solve it and reset_index at the end:

df = (df.join(df.loc[df['relevant shocks'].str.len() > 0, 'relevant shocks']
                .apply(pd.Series).stack().reset_index(1,name='relevant shock'))
        .fillna('').drop('level_1',1).reset_index(drop=True))
Sign up to request clarification or add additional context in comments.

10 Comments

Thanks for the answer! How can I do that with a dataset of around 14,000 observations where it's not possible to e.g., create the dataframe by hand? Also, I would like to end up with the full dataframe with all columns as written in my initial question. How could I implement these steps for a large amount of observations?
@Julia not sure I understand well your question, but let's say your dataframe is called df, then the only thing you have to do really is the code starting with df = (df.join(df['relevant shocks'].apply(..., you should keep all your column and if you have 3 or 14000 observations, it should work the same way.
@Julia what I wrote before was to explain the manipulation over the column containing the lists and then I create the dataframe by hand because I did not have it :) but it's not necessary steps if you have your data in a dataframe already
Ok thanks, I understand! But somehow, I get the error "invalid type promotion"... I also get it when just trying out your above first line of code.
@Julia I think it's because you don't only have lists in this column but not sure
|
0

Now can use pandas.DataFrame.explode

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.