I have a dataset like this:
user_id lapsed_date start_date end_date
0 A123 2020-01-02 2019-01-02 2019-02-02
1 A123 2020-01-02 2019-02-02 2019-03-02
2 B456 2019-10-01 2019-08-01 2019-09-01
3 B456 2019-10-01 2019-09-01 2019-10-01
generated by this code:
from pandas import DataFrame
sample = {'user_id': ['A123','A123','B456','B456'],
'lapsed_date': ['2020-01-02', '2020-01-02', '2019-10-01', '2019-10-01'],
'start_date' : ['2019-01-02', '2019-02-02', '2019-08-01', '2019-09-01'],
'end_date' : ['2019-02-02', '2019-03-02', '2019-09-01', '2019-10-01']
}
df = pd.DataFrame(sample,columns= ['user_id', 'lapsed_date', 'start_date', 'end_date'])
df['lapsed_date'] = pd.to_datetime(df['lapsed_date'])
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
I'm trying to write a function to achieve this:
user_id lapsed_date start_date end_date
0 A123 2020-01-02 2019-01-02 2019-02-02
1 A123 2020-01-02 2019-02-02 2019-03-02
2 A123 2020-01-02 2019-03-02 2019-04-02
3 A123 2020-01-02 2019-04-02 2019-05-02
4 A123 2020-01-02 2019-05-02 2019-06-02
5 A123 2020-01-02 2019-06-02 2019-07-02
6 A123 2020-01-02 2019-07-02 2019-08-02
7 A123 2020-01-02 2019-08-02 2019-09-02
8 A123 2020-01-02 2019-09-02 2019-10-02
9 A123 2020-01-02 2019-10-02 2019-11-02
10 A123 2020-01-02 2019-11-02 2019-12-02
11 A123 2020-01-02 2019-12-02 2020-01-02
12 B456 2019-10-01 2019-08-01 2019-09-01
13 B456 2019-10-01 2019-09-01 2019-10-01
Essentially the function should keep adding row, for each user_id while the max(end_date) is less than or equal to lapsed_date. The newly added row will take previous row's end_date as start_date, and previous row's end_date + 1 month as end_date.
I have generated this function below.
def add_row(x):
while x['end_date'].max() < x['lapsed_date'].max():
next_month = x['end_date'].max() + pd.DateOffset(months=1)
last_row = x.iloc[-1]
last_row['start_date'] = x['end_date'].max()
last_row['end_date'] = next_month
return x.append(last_row)
return x
It works with all the logic above, except the while loop doesn't work. So I have to apply this function using this apply command manually 10 times:
df = df.groupby('user_id').apply(add_row).reset_index(drop = True)
I'm not really sure what I did wrong with the while loop there. Any advice would be highly appreciated!
x?xis meant to be the dataframe. Hence I tried doingdf.groupby('user_id').apply(add_row). I'm still fairly new to Python :)