12

I have a Pandas dataframe 'df' like this :

         X   Y  
IX1 IX2
A   A1  20  30
    A2  20  30
    A5  20  30
B   B2  20  30
    B4  20  30

It lost some rows, and I want to fill in the gap in the middle like this:

         X   Y  
IX1 IX2
A   A1  20  30
    A2  20  30
    A3  NaN NaN
    A4  NaN NaN
    A5  20  30
B   B2  20  30
    B3  NaN NaN
    B4  20  30

Is there a pythonic way to do this ?

3
  • How would you do it your way? Commented Sep 12, 2012 at 14:28
  • For reference, in numpy there's something called a masked array to handle cases like this. Commented Sep 12, 2012 at 15:58
  • I plan to use 'df.reindex(index= index_mask)', while not figure out how to build 'index_mask' efficiently Commented Sep 13, 2012 at 13:26

1 Answer 1

13

You need to construct your full index, and then use the reindex method of the dataframe. Like so...

import pandas
import StringIO
datastring = StringIO.StringIO("""\
C1,C2,C3,C4
A,A1,20,30
A,A2,20,30
A,A5,20,30
B,B2,20,30
B,B4,20,30""")

dataframe = pandas.read_csv(datastring, index_col=['C1', 'C2'])
full_index = [('A', 'A1'), ('A', 'A2'), ('A', 'A3'), 
              ('A', 'A4'), ('A', 'A5'), ('B', 'B1'), 
              ('B', 'B2'), ('B', 'B3'), ('B', 'B4')]
new_df = dataframe.reindex(full_index)
new_df
      C3  C4
A A1  20  30
  A2  20  30
  A3 NaN NaN
  A4 NaN NaN
  A5  20  30
B B1 NaN NaN
  B2  20  30
  B3  20  30
  B4  20  30

And then you can use the fillna method to set the NaNs to whatever you want.

update (June 2014)

Just had to revisit this myself... In the current version of pandas, there is a function to build MultiIndex from the Cartesian product of iterables. So the above solution could become:

datastring = StringIO.StringIO("""\
C1,C2,C3,C4
A,1,20,30
A,2,20,30
A,5,20,30
B,2,20,30
B,4,20,30""")

dataframe = pandas.read_csv(datastring, index_col=['C1', 'C2'])
full_index = pandas.MultiIndex.from_product([('A', 'B'), range(6)], names=['C1', 'C2'])
new_df = dataframe.reindex(full_index)
new_df
      C3  C4
C1 C2
 A  1  20  30
    2  20  30
    3 NaN NaN
    4 NaN NaN
    5  20  30
 B  1 NaN NaN
    2  20  30
    3  20  30
    4  20  30
    5 NaN NaN

Pretty elegant, in my opinion.

Sign up to request clarification or add additional context in comments.

3 Comments

I think you might want something more like this (sorry, comment does not work with newlines): ` import StringIO datastring = StringIO.StringIO("""\ C1,C2,C3,C4 A,1,20,30 A,2,20,30 A,5,20,30 B,2,20,30 B,4,20,30""") df = pd.read_csv(datastring, index_col=['C1', 'C2']) display(df) full_index = pd.MultiIndex.from_product([('A', 'B'), range(6)], names=['C1', 'C2']) display(full_index) new_df = df.reindex(full_index) display(new_df)`
Awesome answer. (range(6) should be changed to range(6) though.
One more thing: In your second example, you should change the C2 column values in the original dataframe to [1, 2, 5...] instead of [A1, A2, A5...]. As is, reindexing leads to a df full of NaNs.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.