0

I have a dataset like this:

    user_id lapsed_date start_date  end_date
0   A123    2020-01-02  2019-01-02  2019-02-02
1   A123    2020-01-02  2019-02-02  2019-03-02
2   B456    2019-10-01  2019-08-01  2019-09-01
3   B456    2019-10-01  2019-09-01  2019-10-01

generated by this code:

from pandas import DataFrame

sample = {'user_id': ['A123','A123','B456','B456'],
        'lapsed_date': ['2020-01-02', '2020-01-02', '2019-10-01', '2019-10-01'],
        'start_date' : ['2019-01-02', '2019-02-02', '2019-08-01', '2019-09-01'],
        'end_date' : ['2019-02-02', '2019-03-02', '2019-09-01', '2019-10-01']
        }

df = pd.DataFrame(sample,columns= ['user_id', 'lapsed_date', 'start_date', 'end_date'])

df['lapsed_date'] = pd.to_datetime(df['lapsed_date'])
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date']) 

I'm trying to write a function to achieve this:

    user_id lapsed_date start_date  end_date
0   A123    2020-01-02  2019-01-02  2019-02-02
1   A123    2020-01-02  2019-02-02  2019-03-02
2   A123    2020-01-02  2019-03-02  2019-04-02
3   A123    2020-01-02  2019-04-02  2019-05-02
4   A123    2020-01-02  2019-05-02  2019-06-02
5   A123    2020-01-02  2019-06-02  2019-07-02
6   A123    2020-01-02  2019-07-02  2019-08-02
7   A123    2020-01-02  2019-08-02  2019-09-02
8   A123    2020-01-02  2019-09-02  2019-10-02
9   A123    2020-01-02  2019-10-02  2019-11-02
10  A123    2020-01-02  2019-11-02  2019-12-02
11  A123    2020-01-02  2019-12-02  2020-01-02
12  B456    2019-10-01  2019-08-01  2019-09-01
13  B456    2019-10-01  2019-09-01  2019-10-01

Essentially the function should keep adding row, for each user_id while the max(end_date) is less than or equal to lapsed_date. The newly added row will take previous row's end_date as start_date, and previous row's end_date + 1 month as end_date.

I have generated this function below.

def add_row(x):
    while x['end_date'].max() < x['lapsed_date'].max():
        next_month = x['end_date'].max() + pd.DateOffset(months=1)
        last_row = x.iloc[-1]
        last_row['start_date'] = x['end_date'].max()
        last_row['end_date'] = next_month
        return x.append(last_row)
    return x 

It works with all the logic above, except the while loop doesn't work. So I have to apply this function using this apply command manually 10 times:

df = df.groupby('user_id').apply(add_row).reset_index(drop = True)

I'm not really sure what I did wrong with the while loop there. Any advice would be highly appreciated!

2
  • What are you passing to add_row as x? Commented Dec 19, 2019 at 23:46
  • x is meant to be the dataframe. Hence I tried doing df.groupby('user_id').apply(add_row). I'm still fairly new to Python :) Commented Dec 19, 2019 at 23:51

1 Answer 1

1

So there are a few reasons your loop did not work, I will explain them as we go!

def add_row(x):
    while x['end_date'].max() < x['lapsed_date'].max():
        next_month = x['end_date'].max() + pd.DateOffset(months=1)
        last_row = x.iloc[-1]
        last_row['start_date'] = x['end_date'].max()
        last_row['end_date'] = next_month
        return x.append(last_row)
    return x 

In the above, you call return which returns the result to the code that called the function. This essentially stops your loop from iterating multiple times and returns the result of the first append.

return x.append(last_row) Another caveat here is that dataframe.append() does not actually append to the dataframe, you need to call x = x.append(last_row)

Pandas Append

Secondly, I noted that it may be required to do this over multiple, unique user_id rows. Due to this, in the code below, I have split the dataframe into multiple frames, dictated by the total unique user_id's stored in the frame.

Here is how you can get this to work;

import pandas as pd
from pandas import DataFrame

def add_row(df):

    while df['end_date'].max() < df['lapsed_date'].max():

        new_row = {'user_id': df['user_id'][0],
                   'lapsed_date': df['lapsed_date'].max(),
                   'start_date': df['end_date'].max(),
                   'end_date': df['end_date'].max() + pd.DateOffset(months=1),
                   }

        df = df.append(new_row, ignore_index = True)

    return df ## Note the return is called OUTSIDE of the while loop, ensuring only the final result is returned.


sample = {'user_id': ['A123','A123','B456','B456'],
        'lapsed_date': ['2020-01-02', '2020-01-02', '2019-10-01', '2019-10-01'],
        'start_date' : ['2019-01-02', '2019-02-02', '2019-08-01', '2019-09-01'],
        'end_date' : ['2019-02-02', '2019-03-02', '2019-09-01', '2019-10-01']
        }

df = pd.DataFrame(sample,columns= ['user_id', 'lapsed_date', 'start_date', 'end_date'])

df['lapsed_date'] = pd.to_datetime(df['lapsed_date'])
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date']) 


ids = df['user_id'].unique()

g = df.groupby(['user_id'])

result = pd.DataFrame(columns= ['user_id', 'lapsed_date', 'start_date', 'end_date'])

for i in ids:
    group = g.get_group(i)
    result = result.append(add_row(group), ignore_index=True)


print(result)
  1. Split the frames based on unique user id's
  2. Create empty data frame to store result in under result
  3. Iterate over all user_id's
  4. Run the same while loop, ensuring that df is updated with the append rows
  5. Return the result and print

Hope this helps!

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you so much for thorough explanation! I'm going to try this!
It may be better to concat in order to create the result DataFrame than repeatedly append.
Also that for loop is strange. I’m pretty sure you can just iterate over the result of groupby and get the ids and groups that way.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.