How to fill the missing record of Pandas dataframe in pythonic way?

Question

I have a Pandas dataframe 'df' like this :

         X   Y  
IX1 IX2
A   A1  20  30
    A2  20  30
    A5  20  30
B   B2  20  30
    B4  20  30

It lost some rows, and I want to fill in the gap in the middle like this:

         X   Y  
IX1 IX2
A   A1  20  30
    A2  20  30
    A3  NaN NaN
    A4  NaN NaN
    A5  20  30
B   B2  20  30
    B3  NaN NaN
    B4  20  30

Is there a pythonic way to do this ?

For reference, in numpy there's something called a masked array to handle cases like this. — Mu Mind
– Mu Mind, Commented Sep 12, 2012 at 15:58
I plan to use 'df.reindex(index= index_mask)', while not figure out how to build 'index_mask' efficiently — bigbug
– bigbug, Commented Sep 13, 2012 at 13:26

Paul H · Accepted Answer · 2016-12-08 21:31:34Z

13

You need to construct your full index, and then use the reindex method of the dataframe. Like so...

import pandas
import StringIO
datastring = StringIO.StringIO("""\
C1,C2,C3,C4
A,A1,20,30
A,A2,20,30
A,A5,20,30
B,B2,20,30
B,B4,20,30""")

dataframe = pandas.read_csv(datastring, index_col=['C1', 'C2'])
full_index = [('A', 'A1'), ('A', 'A2'), ('A', 'A3'), 
              ('A', 'A4'), ('A', 'A5'), ('B', 'B1'), 
              ('B', 'B2'), ('B', 'B3'), ('B', 'B4')]
new_df = dataframe.reindex(full_index)
new_df
      C3  C4
A A1  20  30
  A2  20  30
  A3 NaN NaN
  A4 NaN NaN
  A5  20  30
B B1 NaN NaN
  B2  20  30
  B3  20  30
  B4  20  30

And then you can use the fillna method to set the NaNs to whatever you want.

update (June 2014)

Just had to revisit this myself... In the current version of pandas, there is a function to build MultiIndex from the Cartesian product of iterables. So the above solution could become:

datastring = StringIO.StringIO("""\
C1,C2,C3,C4
A,1,20,30
A,2,20,30
A,5,20,30
B,2,20,30
B,4,20,30""")

dataframe = pandas.read_csv(datastring, index_col=['C1', 'C2'])
full_index = pandas.MultiIndex.from_product([('A', 'B'), range(6)], names=['C1', 'C2'])
new_df = dataframe.reindex(full_index)
new_df
      C3  C4
C1 C2
 A  1  20  30
    2  20  30
    3 NaN NaN
    4 NaN NaN
    5  20  30
 B  1 NaN NaN
    2  20  30
    3  20  30
    4  20  30
    5 NaN NaN

Pretty elegant, in my opinion.

edited Dec 8, 2016 at 21:31

answered Nov 8, 2012 at 20:42

Paul H

68.7k23 gold badges165 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

john_fries Over a year ago

I think you might want something more like this (sorry, comment does not work with newlines): ` import StringIO datastring = StringIO.StringIO("""\ C1,C2,C3,C4 A,1,20,30 A,2,20,30 A,5,20,30 B,2,20,30 B,4,20,30""") df = pd.read_csv(datastring, index_col=['C1', 'C2']) display(df) full_index = pd.MultiIndex.from_product([('A', 'B'), range(6)], names=['C1', 'C2']) display(full_index) new_df = df.reindex(full_index) display(new_df)`

MarredCheese Over a year ago

Awesome answer. (range(6) should be changed to range(6) though.

MarredCheese Over a year ago

One more thing: In your second example, you should change the C2 column values in the original dataframe to [1, 2, 5...] instead of [A1, A2, A5...]. As is, reindexing leads to a df full of NaNs.

Collectives™ on Stack Overflow

How to fill the missing record of Pandas dataframe in pythonic way?

1 Answer 1

update (June 2014)

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

update (June 2014)

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related