1

I have a dataframe like this:

data = pd.DataFrame({'ID': [1,2,3], 'Dep':[4,5,6], 'Start Date':['2020-01-01', '2020-01-01', '2020-01-01'], 'End Date':['2020-01-03', '2020-01-01', '2020-01-04']})

   ID  Dep  Start Date  End Date
0   1   4   2020-01-01  2020-01-03
1   2   5   2020-01-01  2020-01-01
2   3   6   2020-01-01  2020-01-04

I would like to split dates based on days and create new date. Something like below:

    ID  Dep Start Date  End Date    New Date  
0   1   4   2020-01-01  2020-01-03  2020-01-01   
1   1   4   2020-01-01  2020-01-03  2020-01-02 
2   1   4   2020-01-01  2020-01-03  2020-01-03   
3   2   5   2020-01-01  2020-01-01  2020-01-01    
4   3   6   2020-01-01  2020-01-04  2020-01-01    
5   3   6   2020-01-01  2020-01-04  2020-01-02       
6   3   6   2020-01-01  2020-01-04  2020-01-03
7   3   6   2020-01-01  2020-01-04  2020-01-04

Thank you.

1
  • can't understand your logic? Commented Mar 25, 2021 at 4:56

2 Answers 2

3

Use pd.date_range with df.explode:

In [392]: data['New date'] = data.apply(lambda x: pd.date_range(x['Start Date'], x['End Date']), 1)

In [395]: data = data.explode('New date')

In [396]: data
Out[396]: 
   ID  Dep  Start Date    End Date   New date
0   1    4  2020-01-01  2020-01-03 2020-01-01
0   1    4  2020-01-01  2020-01-03 2020-01-02
0   1    4  2020-01-01  2020-01-03 2020-01-03
1   2    5  2020-01-01  2020-01-01 2020-01-01
2   3    6  2020-01-01  2020-01-04 2020-01-01
2   3    6  2020-01-01  2020-01-04 2020-01-02
2   3    6  2020-01-01  2020-01-04 2020-01-03
2   3    6  2020-01-01  2020-01-04 2020-01-04
Sign up to request clarification or add additional context in comments.

8 Comments

I think explode is bottleneck here.
Oh. What would be better?
Can you also please add timings for lesser data?
Maybe 1000 rows.
190 ms ± 874 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
|
1

If performance is important, you can use this faster solution:

#convert columns to datetimes
data["Start Date"] = pd.to_datetime(data["Start Date"])
data["End Date"] = pd.to_datetime(data["End Date"])

#subtract values and convert to days
s = data["End Date"].sub(data["Start Date"]).dt.days + 1

#repeat index
df = data.loc[data.index.repeat(s)].copy()

#add days by timedeltas
add = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df['New Date'] = df["Start Date"].add(add)

print (df)
   ID  Dep Start Date   End Date   New Date
0   1    4 2020-01-01 2020-01-03 2020-01-01
0   1    4 2020-01-01 2020-01-03 2020-01-02
0   1    4 2020-01-01 2020-01-03 2020-01-03
1   2    5 2020-01-01 2020-01-01 2020-01-01
2   3    6 2020-01-01 2020-01-04 2020-01-01
2   3    6 2020-01-01 2020-01-04 2020-01-02
2   3    6 2020-01-01 2020-01-04 2020-01-03
2   3    6 2020-01-01 2020-01-04 2020-01-04

Timings for 3k rows:

data = pd.concat([data] * 1000, ignore_index=True)

In [12]: %%timeit
    ...: data["Start Date"] = pd.to_datetime(data["Start Date"])
    ...: data["End Date"] = pd.to_datetime(data["End Date"])
    ...: 
    ...: s = data["End Date"].sub(data["Start Date"]).dt.days + 1
    ...: 
    ...: df = data.loc[data.index.repeat(s)].copy()
    ...: 
    ...: df['New Date'] = df["Start Date"].add(pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d'))
    ...: 
10.4 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


#Mayank Porwal answer is 56 times slowier in this sample data
In [13]: %%timeit
    ...: data['New date'] = data.apply(lambda x: pd.date_range(x['Start Date'], x['End Date']), 1)
    ...: 
    ...: data.explode('New date')
    ...: 
    ...: 
590 ms ± 67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.