2

I need to process geographic and statistical data from a big data csv. It contains data from geographical administrative and geostatistical. Municipality, Location, geostatistical basic division and block constitute the hierarchical indexes.

I have to create a new column ['data2'] for every element the max value of the data in the geo index, and divide each block value by that value. For each index level, and the index level value must be different from 0, because the 0 index level value accounts for other types of info not used in the calculation.

                       data1  data2
mun  loc  geo  block
1    0    0    0       20     20
1    1    0    0       10     10
1    1    1    0       10     10   
1    1    1    1       3      3/4
1    1    1    2       4      4/4
1    1    2    0       30     30   
1    1    2    1       1      1/3
1    1    2    2       3      3/3
1    2    1    1       10     10/12
1    2    1    2       12     12/12
2    1    1    1       123    123/123
2    1    1    2       7      7/123
2    1    2    1       6      6/6
2    1    2    2       1      1/6

Any ideas? I have tried with for loops, converting the indexes in columns with reset_index() and iterating by column and row values but the computation is taking forever and I think that is not the correct way to do this kind of operations.

Also, what if I want to get my masks like this, so I can run my calculations to every level.

mun  loc  geo  block
1    0    0    0     False       
1    1    0    0     False       
1    1    1    0     True          
1    1    1    1     False        
1    1    1    2     False        
1    1    2    0     True          
1    1    2    1     False        
1    1    2    2     False        

mun  loc  geo  block
1    0    0    0     False       
1    1    0    0     True       
1    1    1    0     False          
1    1    1    1     False        
1    1    1    2     False
1    2    0    0     True
1    2    2    0     False          
1    2    2    1     False        

mun  loc  geo  block
1    0    0    0     True       
1    1    0    0     False       
1    1    1    0     False          
1    1    1    1     False        
1    1    1    2     False
2    0    0    0     True
2    1    1    0     False          
2    1    2    1     False   
4
  • So you need remove first 4 rows of Dataframe, because in hierarchical indexes are 0 ? And in first row of df2 it is (0 / max(0,0,7.15,9.85)) ? And in second (0 / ???) ? Can you add numbers for second and third row in output? Thanks. I think it is a bit unclear. Commented Oct 13, 2016 at 5:59
  • Edited for clarity. I don't need to remove those rows, I just don't need to run the operations on them, also, 0 not only appears at the top, but at the end of each index value too, so you have all the blocks of geo, and all geos of loc, and all loc of municipality. with the 0 indexes referring to the totals by index. I need to run the max operator to all the blocks of each geo of each loc of each mun and then divide the data1 value for that block by the max, respecting the hierarchical order. Commented Oct 13, 2016 at 6:11
  • Thank you for edit . But I think better instaed value, value.. give sample data e.g. 1,2,3,4,5 and then formulas with numbers are (1 / 4) for first row, then (2 / 2) ? Can you extend sample with numbers and some rows (if necessary) for clarity? Thank you. Commented Oct 13, 2016 at 6:15
  • Thank you for the help. More editing has been done. Put some examples of indexes with value 0. Also, the dataframe contains 80 000 + rows of hierarchical combination. Each index has more element but i just put a few for example purposes. Commented Oct 13, 2016 at 6:24

1 Answer 1

1

You can first create mask from MultiIndex, compare with 0 and check at least one True (at least one 0) by any:

mask = (pd.DataFrame(df.index.values.tolist(), index=df.index) == 0).any(axis=1)
print (mask)
mun  loc  geo  block
1    0    0    0         True
     1    0    0         True
          1    0         True
               1        False
               2        False
          2    0         True
               1        False
               2        False
     2    1    1        False
               2        False
2    1    1    1        False
               2        False
          2    1        False
               2        False
dtype: bool

Then get max values by groupby per first, second and third index, but before filter by boolean indexing only values where are not True in mask:

df1 = df.ix[~mask, 'data1'].groupby(level=['mun','loc','geo']).max()
print (df1)
mun  loc  geo
1    1    1        4
          2        3
     2    1       12
2    1    1      123
          2        6

Then reindex df1 by df.index, remove last level of Multiindex by reset_index, mask values where no change by mask (also is necessary remove last level) and fillna by 1, because dividing return same value.

df1 = df1.reindex(df.reset_index(level=3, drop=True).index)
         .mask(mask.reset_index(level=3, drop=True)).fillna(1)
print (df1)
Name: data1, dtype: int64
mun  loc  geo
1    0    0        1.0
     1    0        1.0
          1        1.0
          1        4.0
          1        4.0
          2        1.0
          2        3.0
          2        3.0
     2    1       12.0
          1       12.0
2    1    1      123.0
          1      123.0
          2        6.0
          2        6.0
Name: data1, dtype: float64

Last divide by div:

print (df['data1'].div(df1.values,axis=0))
mun  loc  geo  block
1    0    0    0        20.000000
     1    0    0        10.000000
          1    0        10.000000
               1         0.750000
               2         1.000000
          2    0        30.000000
               1         0.333333
               2         1.000000
     2    1    1         0.833333
               2         1.000000
2    1    1    1         1.000000
               2         0.056911
          2    1         1.000000
               2         0.166667
dtype: float64
Sign up to request clarification or add additional context in comments.

10 Comments

Working with MultiIndex is not easy, I hope it works very well.
Very useful thank you very much. I just need one more answer. What if I want to create a mask that goes: mun loc geo block 1 0 0 0 True1 0 0 True 1 0 True 1 False 2 False
I screwed the comment sorry. What if I want a mask that goes: mun loc geo block 1 0 0 0 True 1 0 0 True 1 0 True 1 False 2 False 2 0 True 1 False 2 False
I am not sure if i understand you. You need mask what have same size as Dataframe and have same index if need boolean indexing. Can you explain more?
@marco - Can you edit question? Formating in comment is problematic ;)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.