0

I have an unbalanced Pandas MultiIndex DataFrame where each row stores a firm-year observation. Sample period (variable year) ranges from 2013 to 2017. The dataset includes variable event, which is set to 1 if an event happens in a given year.

Sample dataset:

#Create dataset
import pandas as pd

df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
                   'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
                             2016,2017,2013,2014,2015,2014,2015,2016,2017],
                   'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})

df.set_index(['id', 'year'], inplace = True)
df.sort_index(inplace = True)

I would like to create a new column status based on existing column event as follows: whenever the event happens for the first time in column event the value of status column should change from 0 to 1 for all subsequent years (including the year the event happens).

DataFrame with expected variable status:

            event   status 
id   year
1    2013     1       1
     2014     0       1
     2015     0       1
     2016     0       1
     2017     0       1

2    2014     0       0
     2015     0       0
     2016     1       1
     2017     0       1

3    2016     1       1
     2017     0       1

4    2013     0       0
     2014     1       1
     2015     0       1

5    2014     0       0
     2015     0       0
     2016     0       0
     2017     1       1

I haven't found any useful solutions so far, so any advice would be much appreciated. Thanks!

3 Answers 3

3

We can groupby on first level of your index (id) and then mark all the rows which are eq to one. Then use cumsum which also converts True to 1 and False to 0:

df['status'] = df.groupby(level=0).apply(lambda x: x.eq(1).cumsum())

Output

         event  status
id year               
1  2013      1       1
   2014      0       1
   2015      0       1
   2016      0       1
   2017      0       1
2  2014      0       0
   2015      0       0
   2016      1       1
   2017      0       1
3  2016      1       1
   2017      0       1
4  2013      0       0
   2014      1       1
   2015      0       1
5  2014      0       0
   2015      0       0
   2016      0       0
   2017      1       1
Sign up to request clarification or add additional context in comments.

Comments

0

Key is to use cumsum under groupby

df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
                   'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
                             2016,2017,2013,2014,2015,2014,2015,2016,2017],
                   'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})


(df.assign(status = lambda x: x.event.eq(1).mul(1).groupby(x['id']).cumsum())
   .set_index(['id','year']))

Output

        event   status
id  year        
1   2013    1   1
    2014    0   1
    2015    0   1
    2016    0   1
    2017    0   1
2   2014    0   0
    2015    0   0
    2016    1   1
    2017    0   1
3   2016    1   1
    2017    0   1
4   2013    0   0
    2014    1   1
    2015    0   1
5   2014    0   0
    2015    0   0
    2016    0   0
    2017    1   1

4 Comments

The idea is correct, but notice that OP does not have the column id since it is set as index.
@Erfan I don't see any reason setting those as index in the first place
That's the dataframe that OP provides. Just trying to help you out so you have a correct answers, since I wanted to upvote it. Right now it's wrong
What bothers me is that without .mul(1), cumsum does not cast boolean into integers. very werid.
0

Basic answer with passages explained:

import pandas as pd

df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
                   'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
                             2016,2017,2013,2014,2015,2014,2015,2016,2017],
                   'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})


# extract unique IDs as list
ids = list(set(df["id"]))

# initialize a list to keep the results
list_event_years =[]
#open a loop on IDs
for id in ids :
    # set happened to 0
    event_happened = 0
    # open a loop on DF pertaining to the actual ID
    for index, row in df[df["id"] == id].iterrows() :
        # if event happened set the variable to 1
        if row["event"] == 1 :
            event_happened = 1
        # add the var to the list of results
        list_event_years.append(event_happened)

# add the list of results as DF column
df["event-happened"] = list_event_years

### OUTPUT
>>> df
    id  year  event  event-year
0    1  2013      1           1
1    1  2014      0           1
2    1  2015      0           1
3    1  2016      0           1
4    1  2017      0           1
5    2  2014      0           0
6    2  2015      0           0
7    2  2016      1           1
8    2  2017      0           1
9    3  2016      1           1
10   3  2017      0           1
11   4  2013      0           0
12   4  2014      1           1
13   4  2015      0           1
14   5  2014      0           0
15   5  2015      0           0
16   5  2016      0           0
17   5  2017      1           1

and if you need them to be indexed as in the example do it:

df.set_index(['id', 'year'], inplace = True)
df.sort_index(inplace = True)

### OUTPUT
>>> df
         event  event-year
id year                   
1  2013      1           1
   2014      0           1
   2015      0           1
   2016      0           1
   2017      0           1
2  2014      0           0
   2015      0           0
   2016      1           1
   2017      0           1
3  2016      1           1
   2017      0           1
4  2013      0           0
   2014      1           1
   2015      0           1
5  2014      0           0
   2015      0           0
   2016      0           0
   2017      1           1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.