1

I have a DataFrame that looks like this:

df = pd.DataFrame.from_dict({'id':       [1, 2, 1, 1, 2, 3],
                             'reward':  [0.1, 0.25, 0.15, 0.05, 0.4, 0.45],
                            'time': ['10:00:00', '12:00:00', '10:00:05', '10:00:07', '12:00:03', '15:00:00']} )

What I want to get is:

out = pd.DataFrame.from_dict({'id':       [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3],
                             'reward':  [0.1, 0, 0, 0, 0, 0.15, 0.0, 0.05,  0.25, 0.0, 0.0, 0.4, 0.45],
                            'time': ['10:00:00',  '10:00:01', '10:00:02',  '10:00:03', '10:00:04', '10:00:05', '10:00:06', '10:00:07', 
                                     '12:00:00', '12:00:01', '12:00:02', '12:00:03', '15:00:00']} )

In short, for each id, add the time rows missing with value 0. How do I do this? I wrote something with a loop, but it's going to be prohibitively slow for my use case which has several million rows

2 Answers 2

1

Here's one way using groupby.apply where we use date_range to add the missing times. Then merge it back to df and fill in the missing values of the other columns:

df['time'] = pd.to_datetime(df['time'])
out = df.merge(df.groupby('id')['time'].apply(lambda x: pd.date_range(x.iat[0], x.iat[-1], freq='S')).explode(), how='right')
out['id'] = out['id'].ffill().astype(int)
out['reward'] = out['reward'].fillna(0)

Output:

    id  reward                time
0    1    0.10 2022-04-23 10:00:00
1    1    0.00 2022-04-23 10:00:01
2    1    0.00 2022-04-23 10:00:02
3    1    0.00 2022-04-23 10:00:03
4    1    0.00 2022-04-23 10:00:04
5    1    0.15 2022-04-23 10:00:05
6    1    0.00 2022-04-23 10:00:06
7    1    0.05 2022-04-23 10:00:07
8    2    0.25 2022-04-23 12:00:00
9    2    0.00 2022-04-23 12:00:01
10   2    0.00 2022-04-23 12:00:02
11   2    0.40 2022-04-23 12:00:03
12   3    0.45 2022-04-23 15:00:00
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, this works great for the example, but for the real world dataset, I get some duplicate rows. Particularly when there's just one row in the original dataset, I get 2 in the result, I've isolated it to the merge step, but I can't find a reproducible example. Any idea what's going on?
@user623949 that’s strange. Do you have any other columns that could cause such duplicity?
Yes, that's what makes it hard to reproduce: I have 133 columns, so I don't know if any one of them is causing the issue, what should I watch for?
1

One option is with complete from pyjanitor to abstract the process:

# dev version has some performance improvements
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor

df = df.astype({'time':np.datetime64})

# create mapping for expanded time
new_time = {'time' : lambda df: pd.date_range(df.min(), df.max(), freq='1S')}

# generate expanded rows
df.complete(new_time, by = 'id', fill_value = 0)

    id  reward                time
0    1    0.10 2022-04-24 10:00:00
1    1    0.00 2022-04-24 10:00:01
2    1    0.00 2022-04-24 10:00:02
3    1    0.00 2022-04-24 10:00:03
4    1    0.00 2022-04-24 10:00:04
5    1    0.15 2022-04-24 10:00:05
6    1    0.00 2022-04-24 10:00:06
7    1    0.05 2022-04-24 10:00:07
8    2    0.25 2022-04-24 12:00:00
9    2    0.00 2022-04-24 12:00:01
10   2    0.00 2022-04-24 12:00:02
11   2    0.40 2022-04-24 12:00:03
12   3    0.45 2022-04-24 15:00:00

Another option, which could be faster is using a combination of groupby, explode and merge:

# get the min and max dates
temp = df.groupby('id').time.agg(['min', 'max'])

# generate list of dates
outcome = [pd.date_range(start, end, freq='1S') 
           for start, end in 
           zip(temp['min'], temp['max'])]

outcome = pd.Series(outcome, index = temp.index).rename('time').explode()

# merge back to original df
(pd
.merge(outcome, df, on = ['id', 'time'], how = 'outer')
.fillna({'reward':0})
.loc[:, df.columns]
)

    id  reward                time
0    1    0.10 2022-04-24 10:00:00
1    1    0.00 2022-04-24 10:00:01
2    1    0.00 2022-04-24 10:00:02
3    1    0.00 2022-04-24 10:00:03
4    1    0.00 2022-04-24 10:00:04
5    1    0.15 2022-04-24 10:00:05
6    1    0.00 2022-04-24 10:00:06
7    1    0.05 2022-04-24 10:00:07
8    2    0.25 2022-04-24 12:00:00
9    2    0.00 2022-04-24 12:00:01
10   2    0.00 2022-04-24 12:00:02
11   2    0.40 2022-04-24 12:00:03
12   3    0.45 2022-04-24 15:00:00

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.