generating missing rows in running index in pandas

Question

I have a dataframe generated by this code

lcust = [1, 1, 1, 1, 2, 2, 3, 3, 3, 3]
lmonth = [3, 4, 5, 9, 3, 5, 99, 101, 102, 105]
lval1 = np.random.randint(2, 100, len(lmonth)).tolist()
lval2 = np.random.rand(len(lmonth)).tolist()
index_ = pd.MultiIndex.from_arrays([lcust, lmonth], names=('number','month'))
df_ = pd.DataFrame(data=np.array([lval1, lval2]).T, columns = ['val1', 'val2'], index = index_)

It looks as follows:

              val1      val2
number month                
1      3       8.0  0.306048
       4      45.0  0.151272
       5      91.0  0.695793
       9      50.0  0.927028
2      3      68.0  0.925622
       5      49.0  0.402069
3      99     58.0  0.704662
      101    93.0  0.759338
      102    10.0  0.555434
      105    39.0  0.030003

My question is whether there is a convenient way to get it to look like this:

              val1_y    val2_y
number month                  
1      3         8.0  0.306048
       4        45.0  0.151272
       5        91.0  0.695793
       6         0.0  0.000000
       7         0.0  0.000000
       8         0.0  0.000000
       9        50.0  0.927028
2      3        68.0  0.925622
       4         0.0  0.000000
       5        49.0  0.402069
3      99       58.0  0.704662
       100       0.0  0.000000
       101      93.0  0.759338
       102      10.0  0.555434
       103       0.0  0.000000
       104       0.0  0.000000
       105      39.0  0.030003

That is, I am looking for some code to fill out the missing months. In my database these values are just mmissing, but in actuality they should be zero and I need them for further calculations.You can think of number being a customer ID and month is the number of month the customer is a member. val1 and val2 are some values of interest.

Please let me know in case you need further information.

Many thanks c

Tai · Accepted Answer · 2018-01-25 20:00:20Z

2

def fill_missing(x):
    """x is a group after group by `number`"""
    return x.reindex(
               list((x.name, v) for v in range(x.index[0][1], x.index[-1][1]+1))
           ).fillna(0)
ret = df.groupby("number", as_index=False)["val1", "val2"].apply(fill_missing)

The method is modified from the example in the documentation.

Basically, the method uses reindex method to add the indexes that did not exist. The new indexes are created with the following line:

list((x.name, v) for v in range(x.index[0][1], x.index[-1][1]+1))

This list comprehension takes the first month index x.index[0][1] and the last x.index[-1][1]+1 to create all months in between.

For example, when number is 1, the first month index is 3 and the last is 9. Then, the list compreshension creates [(1,3), (1,4), (1,5), (1,6), (1,7), (1,8), (1,9)]. Here you see that the new indeces are created by the list comprehension. (We use x.name to locate the group name. so here the x.name is 1.) This list will be the new indices that we put into reindex.

Now, for the indexes that do not exist part, {(1,6), (1,7), (1,8)}, reindex will fill in nan. (reindex added these indices in.) We then fill nan part with 0 via fillna(0).

edited Jan 25, 2018 at 20:00

answered Jan 24, 2018 at 19:57

Tai

8,0643 gold badges31 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

clog14 Over a year ago

hi, thanks. do you know why the additional index is added?

Tai Over a year ago

@clog14 I added more lines. Please see if it makes more sense now.

clog14 Over a year ago

Hi, thanks for the more detailed expalantion. However, do you see that there is an unnamed index level in your output on the very left that has been added. Do you know where that is coming from?

Tai Over a year ago

@clog14 that is a byproduct from the groupby function.

Tai Over a year ago

@clog14 You can use ret.reset_index(level=0) to get rid of that. Note this is not done in place.

BENY · Accepted Answer · 2018-01-24 19:30:27Z

1

I break those steps :-)

df=df_.reset_index()
idx=df.groupby('number').month.apply(lambda x : list(range(x.min(),x.max()+1))).apply(pd.Series).stack().reset_index(level=1,drop=True)
df_.reindex(pd.MultiIndex.from_arrays([idx.index.tolist(),idx.tolist()])).fillna(0)
Out[646]: 
         val1      val2
1 3.0    62.0  0.315113
  4.0    55.0  0.145617
  5.0    96.0  0.945375
  6.0     0.0  0.000000
  7.0     0.0  0.000000
  8.0     0.0  0.000000
  9.0    22.0  0.566370
2 3.0    77.0  0.299537
  4.0     0.0  0.000000
  5.0    25.0  0.316074
3 99.0   66.0  0.346118
  100.0   0.0  0.000000
  101.0  40.0  0.838624
  102.0  33.0  0.123600
  103.0   0.0  0.000000
  104.0   0.0  0.000000
  105.0  10.0  0.052360

answered Jan 24, 2018 at 19:30

BENY

324k22 gold badges176 silver badges250 bronze badges

Collectives™ on Stack Overflow

generating missing rows in running index in pandas

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related