1

I have a DataFrame consisting of dates, other columns and a numerical value, where some value combinations in "other columns" could be missing, and I want to populate them from previous dates.

Example. Say the DataFrame is like below. You can see on 2016-01-01, we have data for (LN, A), (LN, B), (NY, A) and (NY, B) on columns (location, band).

        date  location  band  value
0 2016-01-01        LN     A   10.0
1 2016-01-01        LN     B    5.0
2 2016-01-01        NY     A    9.0
3 2016-01-01        NY     B    6.0
4 2016-01-02        LN     A   11.0
5 2016-01-02        NY     B    7.0
6 2016-01-03        NY     A   10.0

Then you notice on 2016-01-02, we only have (LN, A) and (NY, B), but (LN, B) and (NY, A) are missing. Again, on 2016-01-03, only (NY, A) is available; all other three combinations are missing.

What I want to do is to populate the missing combinations of each date from its predecessor. Say for 2016-01-02, I would like to add two more rows, "rolled over" from 2016-01-01: (LN, B, 5.0) and (NY, A, 9.0) for columns (location, band, value). Same for 2016-01-03. So as to make the whole thing like below:

        date  location  band  value
 0 2016-01-01        LN     A   10.0
 1 2016-01-01        LN     B    5.0
 2 2016-01-01        NY     A    9.0
 3 2016-01-01        NY     B    6.0
 4 2016-01-02        LN     A   11.0
 5 2016-01-02        NY     B    7.0
 6 2016-01-03        NY     A   10.0
 7 2016-01-02        LN     B    5.0
 8 2016-01-02        NY     A    9.0
 9 2016-01-03        LN     A   11.0
10 2016-01-03        LN     B    5.0
11 2016-01-03        NY     B    7.0

Note rows 7-11 are populated from rows 1, 2, 4, 7 and 5, respectively. The order is not really important as I can always sort afterwards if all the data I need is present.

Anyone to help? Thanks a lot!

2 Answers 2

2

You can use a unstack/stack method to get all missing values, followed by a forward fill:

# Use unstack/stack to add missing locations.
df = df.set_index(['date', 'location', 'band']) \
       .unstack(level=['location', 'band']) \
       .stack(level=['location', 'band'], dropna=False)

# Forward fill NaN values within ['location', 'band'] groups.
df = df.groupby(level=['location', 'band']).ffill().reset_index()

Or you can directly build a MultiIndex containing all combinations:

# Build the full MultiIndex, set the partial MultiIndex, and reindex.
levels = ['date', 'location', 'band']
full_idx = pd.MultiIndex.from_product([df[col].unique() for col in levels], names=levels)
df = df.set_index(levels).reindex(full_idx)

# Forward fill NaN values within ['location', 'band'] groups.
df = df.groupby(level=['location', 'band']).ffill().reset_index()

The resulting output for either method:

         date location band  value
0  2016-01-01       LN    A   10.0
1  2016-01-01       LN    B    5.0
2  2016-01-01       NY    A    9.0
3  2016-01-01       NY    B    6.0
4  2016-01-02       LN    A   11.0
5  2016-01-02       LN    B    5.0
6  2016-01-02       NY    A    9.0
7  2016-01-02       NY    B    7.0
8  2016-01-03       LN    A   11.0
9  2016-01-03       LN    B    5.0
10 2016-01-03       NY    A   10.0
11 2016-01-03       NY    B    7.0
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the answer! I actually prefer your first solution to the second one - see my reply to @Dickster whose answer is essentially the same as your second one.
This question is old, but it's exactly what I'm looking for at the moment - the only difference is that I want it to be filled with NaN. How can I do that? If I remove the ffill, it does not work to reset the index (which is needed)...
0

My solution, in summary using the product operation to get all the combinations in a multi index, then some stacking and ffill().

df =pd.DataFrame({'date': {0: '2016-01-01', 1: '2016-01-01', 2: '2016-01-01', 3: '2016-01-01', 4: '2016-01-02', 5: '2016-01-02', 6: '2016-01-03'}, 'band': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'A', 5: 'B', 6: 'A'}, 'location': {0: 'LN', 1: 'LN', 2: 'NY', 3: 'NY', 4: 'LN', 5: 'NY', 6: 'NY'}, 'value': {0: 10, 1: 5, 2: 9, 3: 6, 4: 11, 5: 7, 6: 10}})
unique_dates = df['date'].unique()
df.set_index(['date','location','band'],inplace=True)
idx = pd.MultiIndex.from_product([unique_dates,['LN','NY'],['A','B']])
df  = df.reindex(idx)
df = df.unstack(level=[2,1])

which produces:

             value                      
                 A      B       A      B
                LN     LN      NY     NY
2016-01-01 10.0000 5.0000  9.0000 6.0000
2016-01-02 11.0000    nan     nan 7.0000
2016-01-03     nan    nan 10.0000    nan

and finally:

df = df.ffill()
df = df.stack().stack()
print df


                  value
2016-01-01 LN A 10.0000
              B  5.0000
           NY A  9.0000
              B  6.0000
2016-01-02 LN A 11.0000
              B  5.0000
           NY A  9.0000
              B  7.0000
2016-01-03 LN A 11.0000
              B  5.0000
           NY A 10.0000
              B  7.0000

2 Comments

Thanks a lot! Afraid I might wanna take the other answer as the solution to this problem (the unstack/stack one), because (partially of my fault of giving a bad example in my question) I might not need the full combination of dates, ['LN', 'NY'] and ['A', 'B'].Say if the input
Thanks a lot! Afraid I might wanna take the other answer as the solution to this problem (the unstack/stack one), because (partially of my fault of giving a bad example in my question) I might not need the full combination of dates, ['LN', 'NY'] and ['A', 'B']. Say if the input in my original example did not have rows 0 and 4, then I wouldn't expect row 9 in the output either, which can be achieved via the solution from @root. Thanks again for the answer anyway : )

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.