2

I have a dataframe generated by this code

lcust = [1, 1, 1, 1, 2, 2, 3, 3, 3, 3]
lmonth = [3, 4, 5, 9, 3, 5, 99, 101, 102, 105]
lval1 = np.random.randint(2, 100, len(lmonth)).tolist()
lval2 = np.random.rand(len(lmonth)).tolist()
index_ = pd.MultiIndex.from_arrays([lcust, lmonth], names=('number','month'))
df_ = pd.DataFrame(data=np.array([lval1, lval2]).T, columns = ['val1', 'val2'], index = index_)

It looks as follows:

              val1      val2
number month                
1      3       8.0  0.306048
       4      45.0  0.151272
       5      91.0  0.695793
       9      50.0  0.927028
2      3      68.0  0.925622
       5      49.0  0.402069
3      99     58.0  0.704662
      101    93.0  0.759338
      102    10.0  0.555434
      105    39.0  0.030003

My question is whether there is a convenient way to get it to look like this:

              val1_y    val2_y
number month                  
1      3         8.0  0.306048
       4        45.0  0.151272
       5        91.0  0.695793
       6         0.0  0.000000
       7         0.0  0.000000
       8         0.0  0.000000
       9        50.0  0.927028
2      3        68.0  0.925622
       4         0.0  0.000000
       5        49.0  0.402069
3      99       58.0  0.704662
       100       0.0  0.000000
       101      93.0  0.759338
       102      10.0  0.555434
       103       0.0  0.000000
       104       0.0  0.000000
       105      39.0  0.030003

That is, I am looking for some code to fill out the missing months. In my database these values are just mmissing, but in actuality they should be zero and I need them for further calculations.You can think of number being a customer ID and month is the number of month the customer is a member. val1 and val2 are some values of interest.

Please let me know in case you need further information.

Many thanks c

2 Answers 2

2
def fill_missing(x):
    """x is a group after group by `number`"""
    return x.reindex(
               list((x.name, v) for v in range(x.index[0][1], x.index[-1][1]+1))
           ).fillna(0)
ret = df.groupby("number", as_index=False)["val1", "val2"].apply(fill_missing)

The method is modified from the example in the documentation.

Basically, the method uses reindex method to add the indexes that did not exist. The new indexes are created with the following line:

list((x.name, v) for v in range(x.index[0][1], x.index[-1][1]+1))

This list comprehension takes the first month index x.index[0][1] and the last x.index[-1][1]+1 to create all months in between.

For example, when number is 1, the first month index is 3 and the last is 9. Then, the list compreshension creates [(1,3), (1,4), (1,5), (1,6), (1,7), (1,8), (1,9)]. Here you see that the new indeces are created by the list comprehension. (We use x.name to locate the group name. so here the x.name is 1.) This list will be the new indices that we put into reindex.

Now, for the indexes that do not exist part, {(1,6), (1,7), (1,8)}, reindex will fill in nan. (reindex added these indices in.) We then fill nan part with 0 via fillna(0).

enter image description here

Sign up to request clarification or add additional context in comments.

5 Comments

hi, thanks. do you know why the additional index is added?
@clog14 I added more lines. Please see if it makes more sense now.
Hi, thanks for the more detailed expalantion. However, do you see that there is an unnamed index level in your output on the very left that has been added. Do you know where that is coming from?
@clog14 that is a byproduct from the groupby function.
@clog14 You can use ret.reset_index(level=0) to get rid of that. Note this is not done in place.
1

I break those steps :-)

df=df_.reset_index()
idx=df.groupby('number').month.apply(lambda x : list(range(x.min(),x.max()+1))).apply(pd.Series).stack().reset_index(level=1,drop=True)
df_.reindex(pd.MultiIndex.from_arrays([idx.index.tolist(),idx.tolist()])).fillna(0)
Out[646]: 
         val1      val2
1 3.0    62.0  0.315113
  4.0    55.0  0.145617
  5.0    96.0  0.945375
  6.0     0.0  0.000000
  7.0     0.0  0.000000
  8.0     0.0  0.000000
  9.0    22.0  0.566370
2 3.0    77.0  0.299537
  4.0     0.0  0.000000
  5.0    25.0  0.316074
3 99.0   66.0  0.346118
  100.0   0.0  0.000000
  101.0  40.0  0.838624
  102.0  33.0  0.123600
  103.0   0.0  0.000000
  104.0   0.0  0.000000
  105.0  10.0  0.052360

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.