2

I have a collection of transactions with a date and a price column:

+---------------------------+-------+
|           Date            | Price |
+---------------------------+-------+
| 2016-05-27 10:02:24+00:00 |  2.90 |
| 2016-05-27 10:02:24+00:00 | 14.90 |
| 2016-05-29 07:47:09+00:00 | 12.90 |
| 2016-05-29 11:56:32+00:00 | 16.90 |
| 2016-05-29 22:10:08+00:00 | 11.92 |
+---------------------------+-------+

as it is possible to understand from the table not every day a transaction happened, and in some cases several transactions happened the same day.

My question is: how can I create a DataFrame with dates from the oldest transaction to the newest and add to this DataFrame missing dates with price 0, while keepping multiple rows for transaction that happened in the same day? A better example will be in the following table:

+---------------------------+-------+
|           Date            | Price |
+---------------------------+-------+
| 2016-05-27 10:02:24+00:00 |  2.90 |
| 2016-05-27 10:02:24+00:00 | 14.90 |
| 2016-05-28 00:00:00+00:00 |  0.00 |
| 2016-05-29 07:47:09+00:00 | 12.90 |
| 2016-05-29 11:56:32+00:00 | 16.90 |
| 2016-05-29 22:10:08+00:00 | 11.92 |
+---------------------------+-------+ 

I have tried to create a series with DateRange from the oldest to the newest, and then adding the series to the DataFrame, but doing this leads to having some missing values:

d2 = pd.Series(pd.date_range(min(df.Date), max(df.Date)))

df['dates'] = d2 

2 Answers 2

2

You can find which dates are missing, then concatenate the missings back

import pandas as pd

missings = [x for x in pd.date_range(df.Date.min().date(), df.Date.max().date(), freq='1D').date
            if x not in df.Date.dt.date.unique()]

df = (pd.concat([df, pd.DataFrame({'Date': pd.to_datetime(missings).tz_localize('UTC'), 'Price': 0})])
        .sort_values('Date'))

Output:

                       Date  Price
0 2016-05-27 10:02:24+00:00   2.90
1 2016-05-27 10:02:24+00:00  14.90
0 2016-05-28 00:00:00+00:00   0.00
2 2016-05-29 07:47:09+00:00  12.90
3 2016-05-29 11:56:32+00:00  16.90
4 2016-05-29 22:10:08+00:00  11.92

Also possible to find the missing dates with sets, should be a bit faster

missings = list(set(pd.date_range(df.Date.min().date(), df.Date.max().date(), freq='1D', tz='UTC').values) 
                 - set(df.Date.dt.normalize().values))
Sign up to request clarification or add additional context in comments.

Comments

0

You can create a Series with that min-max daterange, outer merge and fillna with 0:

df.Date = pd.to_datetime(df.Date)
rng = pd.date_range(start=df.Date.min(), end=df.Date.max(), freq='D')
df = df.set_index('Date')
pd.merge(df, pd.Series(index=rng, name='rng'), how='outer', left_index=True, right_index=True).drop('rng', 1).fillna(0)

Output:

    Price
2016-05-27 10:02:24     2.900
2016-05-27 10:02:24     14.900
2016-05-28 10:02:24     0.000
2016-05-29 07:47:09     12.900
2016-05-29 10:02:24     0.000
2016-05-29 11:56:32     16.900
2016-05-29 22:10:08     11.920

Note that I ignored the UTC offsets for convenience, I don't think it should affect my solution. Also note that your times for the interpolated days will be the same as your minimum date.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.