Operations in multi index dataframe pandas

Question

I need to process geographic and statistical data from a big data csv. It contains data from geographical administrative and geostatistical. Municipality, Location, geostatistical basic division and block constitute the hierarchical indexes.

I have to create a new column ['data2'] for every element the max value of the data in the geo index, and divide each block value by that value. For each index level, and the index level value must be different from 0, because the 0 index level value accounts for other types of info not used in the calculation.

                       data1  data2
mun  loc  geo  block
1    0    0    0       20     20
1    1    0    0       10     10
1    1    1    0       10     10   
1    1    1    1       3      3/4
1    1    1    2       4      4/4
1    1    2    0       30     30   
1    1    2    1       1      1/3
1    1    2    2       3      3/3
1    2    1    1       10     10/12
1    2    1    2       12     12/12
2    1    1    1       123    123/123
2    1    1    2       7      7/123
2    1    2    1       6      6/6
2    1    2    2       1      1/6

Any ideas? I have tried with for loops, converting the indexes in columns with reset_index() and iterating by column and row values but the computation is taking forever and I think that is not the correct way to do this kind of operations.

Also, what if I want to get my masks like this, so I can run my calculations to every level.

mun  loc  geo  block
1    0    0    0     False       
1    1    0    0     False       
1    1    1    0     True          
1    1    1    1     False        
1    1    1    2     False        
1    1    2    0     True          
1    1    2    1     False        
1    1    2    2     False        

mun  loc  geo  block
1    0    0    0     False       
1    1    0    0     True       
1    1    1    0     False          
1    1    1    1     False        
1    1    1    2     False
1    2    0    0     True
1    2    2    0     False          
1    2    2    1     False        

mun  loc  geo  block
1    0    0    0     True       
1    1    0    0     False       
1    1    1    0     False          
1    1    1    1     False        
1    1    1    2     False
2    0    0    0     True
2    1    1    0     False          
2    1    2    1     False

So you need remove first 4 rows of Dataframe, because in hierarchical indexes are 0 ? And in first row of df2 it is (0 / max(0,0,7.15,9.85)) ? And in second (0 / ???) ? Can you add numbers for second and third row in output? Thanks. I think it is a bit unclear. — jezrael
– jezrael, Commented Oct 13, 2016 at 5:59
Edited for clarity. I don't need to remove those rows, I just don't need to run the operations on them, also, 0 not only appears at the top, but at the end of each index value too, so you have all the blocks of geo, and all geos of loc, and all loc of municipality. with the 0 indexes referring to the totals by index. I need to run the max operator to all the blocks of each geo of each loc of each mun and then divide the data1 value for that block by the max, respecting the hierarchical order. — marco
– marco, Commented Oct 13, 2016 at 6:11
Thank you for edit . But I think better instaed value, value.. give sample data e.g. 1,2,3,4,5 and then formulas with numbers are (1 / 4) for first row, then (2 / 2) ? Can you extend sample with numbers and some rows (if necessary) for clarity? Thank you. — jezrael
– jezrael, Commented Oct 13, 2016 at 6:15
Thank you for the help. More editing has been done. Put some examples of indexes with value 0. Also, the dataframe contains 80 000 + rows of hierarchical combination. Each index has more element but i just put a few for example purposes. — marco
– marco, Commented Oct 13, 2016 at 6:24

jezrael · Accepted Answer · 2016-10-13 07:27:43Z

1

You can first create mask from MultiIndex, compare with 0 and check at least one True (at least one 0) by any:

mask = (pd.DataFrame(df.index.values.tolist(), index=df.index) == 0).any(axis=1)
print (mask)
mun  loc  geo  block
1    0    0    0         True
     1    0    0         True
          1    0         True
               1        False
               2        False
          2    0         True
               1        False
               2        False
     2    1    1        False
               2        False
2    1    1    1        False
               2        False
          2    1        False
               2        False
dtype: bool

Then get max values by groupby per first, second and third index, but before filter by boolean indexing only values where are not True in mask:

df1 = df.ix[~mask, 'data1'].groupby(level=['mun','loc','geo']).max()
print (df1)
mun  loc  geo
1    1    1        4
          2        3
     2    1       12
2    1    1      123
          2        6

Then reindex df1 by df.index, remove last level of Multiindex by reset_index, mask values where no change by mask (also is necessary remove last level) and fillna by 1, because dividing return same value.

df1 = df1.reindex(df.reset_index(level=3, drop=True).index)
         .mask(mask.reset_index(level=3, drop=True)).fillna(1)
print (df1)
Name: data1, dtype: int64
mun  loc  geo
1    0    0        1.0
     1    0        1.0
          1        1.0
          1        4.0
          1        4.0
          2        1.0
          2        3.0
          2        3.0
     2    1       12.0
          1       12.0
2    1    1      123.0
          1      123.0
          2        6.0
          2        6.0
Name: data1, dtype: float64

Last divide by div:

print (df['data1'].div(df1.values,axis=0))
mun  loc  geo  block
1    0    0    0        20.000000
     1    0    0        10.000000
          1    0        10.000000
               1         0.750000
               2         1.000000
          2    0        30.000000
               1         0.333333
               2         1.000000
     2    1    1         0.833333
               2         1.000000
2    1    1    1         1.000000
               2         0.056911
          2    1         1.000000
               2         0.166667
dtype: float64

edited Oct 13, 2016 at 7:27

answered Oct 13, 2016 at 7:20

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

jezrael Over a year ago

Working with MultiIndex is not easy, I hope it works very well.

marco Over a year ago

Very useful thank you very much. I just need one more answer. What if I want to create a mask that goes: mun loc geo block 1 0 0 0 True1 0 0 True 1 0 True 1 False 2 False

marco Over a year ago

I screwed the comment sorry. What if I want a mask that goes: mun loc geo block 1 0 0 0 True 1 0 0 True 1 0 True 1 False 2 False 2 0 True 1 False 2 False

jezrael Over a year ago

I am not sure if i understand you. You need mask what have same size as Dataframe and have same index if need boolean indexing. Can you explain more?

jezrael Over a year ago

@marco - Can you edit question? Formating in comment is problematic ;)

|

Collectives™ on Stack Overflow

Operations in multi index dataframe pandas

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related