MultiIndex DataFrame: How to create a new column based on values in other column?

Question

I have an unbalanced Pandas MultiIndex DataFrame where each row stores a firm-year observation. Sample period (variable year) ranges from 2013 to 2017. The dataset includes variable event, which is set to 1 if an event happens in a given year.

Sample dataset:

#Create dataset
import pandas as pd

df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
                   'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
                             2016,2017,2013,2014,2015,2014,2015,2016,2017],
                   'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})

df.set_index(['id', 'year'], inplace = True)
df.sort_index(inplace = True)

I would like to create a new column status based on existing column event as follows: whenever the event happens for the first time in column event the value of status column should change from 0 to 1 for all subsequent years (including the year the event happens).

DataFrame with expected variable status:

            event   status 
id   year
1    2013     1       1
     2014     0       1
     2015     0       1
     2016     0       1
     2017     0       1

2    2014     0       0
     2015     0       0
     2016     1       1
     2017     0       1

3    2016     1       1
     2017     0       1

4    2013     0       0
     2014     1       1
     2015     0       1

5    2014     0       0
     2015     0       0
     2016     0       0
     2017     1       1

I haven't found any useful solutions so far, so any advice would be much appreciated. Thanks!

Erfan · Accepted Answer · 2019-08-08 15:12:38Z

3

We can groupby on first level of your index (id) and then mark all the rows which are eq to one. Then use cumsum which also converts True to 1 and False to 0:

df['status'] = df.groupby(level=0).apply(lambda x: x.eq(1).cumsum())

Output

         event  status
id year               
1  2013      1       1
   2014      0       1
   2015      0       1
   2016      0       1
   2017      0       1
2  2014      0       0
   2015      0       0
   2016      1       1
   2017      0       1
3  2016      1       1
   2017      0       1
4  2013      0       0
   2014      1       1
   2015      0       1
5  2014      0       0
   2015      0       0
   2016      0       0
   2017      1       1

answered Aug 8, 2019 at 15:12

Erfan

43.3k10 gold badges75 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mark Wang · Accepted Answer · 2019-08-08 15:09:50Z

0

Key is to use cumsum under groupby

df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
                   'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
                             2016,2017,2013,2014,2015,2014,2015,2016,2017],
                   'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})


(df.assign(status = lambda x: x.event.eq(1).mul(1).groupby(x['id']).cumsum())
   .set_index(['id','year']))

Output

        event   status
id  year        
1   2013    1   1
    2014    0   1
    2015    0   1
    2016    0   1
    2017    0   1
2   2014    0   0
    2015    0   0
    2016    1   1
    2017    0   1
3   2016    1   1
    2017    0   1
4   2013    0   0
    2014    1   1
    2015    0   1
5   2014    0   0
    2015    0   0
    2016    0   0
    2017    1   1

answered Aug 8, 2019 at 15:09

Mark Wang

2,7579 silver badges18 bronze badges

4 Comments

Erfan Over a year ago

The idea is correct, but notice that OP does not have the column id since it is set as index.

Mark Wang Over a year ago

@Erfan I don't see any reason setting those as index in the first place

Erfan Over a year ago

That's the dataframe that OP provides. Just trying to help you out so you have a correct answers, since I wanted to upvote it. Right now it's wrong

Mark Wang Over a year ago

What bothers me is that without .mul(1), cumsum does not cast boolean into integers. very werid.

cccnrc · Accepted Answer · 2019-08-08 15:13:11Z

Basic answer with passages explained:

import pandas as pd

df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
                   'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
                             2016,2017,2013,2014,2015,2014,2015,2016,2017],
                   'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})


# extract unique IDs as list
ids = list(set(df["id"]))

# initialize a list to keep the results
list_event_years =[]
#open a loop on IDs
for id in ids :
    # set happened to 0
    event_happened = 0
    # open a loop on DF pertaining to the actual ID
    for index, row in df[df["id"] == id].iterrows() :
        # if event happened set the variable to 1
        if row["event"] == 1 :
            event_happened = 1
        # add the var to the list of results
        list_event_years.append(event_happened)

# add the list of results as DF column
df["event-happened"] = list_event_years

### OUTPUT
>>> df
    id  year  event  event-year
0    1  2013      1           1
1    1  2014      0           1
2    1  2015      0           1
3    1  2016      0           1
4    1  2017      0           1
5    2  2014      0           0
6    2  2015      0           0
7    2  2016      1           1
8    2  2017      0           1
9    3  2016      1           1
10   3  2017      0           1
11   4  2013      0           0
12   4  2014      1           1
13   4  2015      0           1
14   5  2014      0           0
15   5  2015      0           0
16   5  2016      0           0
17   5  2017      1           1

and if you need them to be indexed as in the example do it:

df.set_index(['id', 'year'], inplace = True)
df.sort_index(inplace = True)

### OUTPUT
>>> df
         event  event-year
id year                   
1  2013      1           1
   2014      0           1
   2015      0           1
   2016      0           1
   2017      0           1
2  2014      0           0
   2015      0           0
   2016      1           1
   2017      0           1
3  2016      1           1
   2017      0           1
4  2013      0           0
   2014      1           1
   2015      0           1
5  2014      0           0
   2015      0           0
   2016      0           0
   2017      1           1

Collectives™ on Stack Overflow

MultiIndex DataFrame: How to create a new column based on values in other column?

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related