pandas dataframe drop rows by multiindex

Question

I'd like to drop rows from a pandas dataframe using the MultiIndex value.

I've tried quite a few things but I put below what I think was closer. (Actually I will explain the full problem since there might be an alternative solutions using a completely different approach). From a correlation matrix, I'd like to get the pair of columns that correlate more. I use unstack and put the results in a dataframe:

In [263]: corr_df = pd.DataFrame(total.corr().unstack())

Then get the higher correlations (actually I should get the negatives as well).

In [264]: high = corr_df[(corr_df[0] > 0.5) & (corr_df[0] < 1.0)]

In [236]: print high
                                                  0
residual sugar       density               0.552517
free sulfur dioxide  total sulfur dioxide  0.720934
total sulfur dioxide free sulfur dioxide   0.720934
                     wine                  0.700357
density              residual sugar        0.552517
wine                 total sulfur dioxide  0.700357

Closed enough, but there are duplicates, that's actually the point of the correlation matrix. In order to clean them up, my idea is to iterate the high values to remove duplicates:

In [267]:
for row in high.iterrows():
    print row[0][0], ",", row[0][1]
    print high.loc[row[0][1]].loc[row[0][0]].index
    high.drop(high.loc[row[0][1]].loc[row[0][0]].index)
residual sugar , density
Int64Index([0], dtype='int64')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-267-1258da2a4772> in <module>()
      2     print row[0][0], ",", row[0][1]
      3     print high.loc[row[0][1]].loc[row[0][0]].index
----> 4     high.drop(high.loc[row[0][1]].loc[row[0][0]].index)

...
[huge stack of errors]
...
KeyError: 0

The method drop is working perfectly when the index is normal (see drop), but, how do I build the label when I got a MultiIndex?

Alexander · Accepted Answer · 2015-04-08 17:37:25Z

2

corr_df = pd.DataFrame(
{'residual sugar': [1, 0, 0, 0.552517, 0], 
'free sulfur dioxide': [0, 1, 0.720934, 0, 0], 
'total sulfur dioxide': [0, 0.720934, 1, 0, 0.700357],
'density': [0.552517, 0, 0, 1, 0],
'wine': [0, 0, 0.700357, 0, 1]}, 
index=['residual sugar', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'wine']).unstack()

# Notice the slight modification to the original
high = corr_df[(corr_df > 0.5) & (corr_df < 1.0)]

# Sort by index, then values
high.sort_index()
high.sort()

# Drop every other value (e.g. just take the evens)
result = high.iloc[[count for count, _ in enumerate(high) if count % 2 == 0]]
>>> result
density               residual sugar          0.552517
total sulfur dioxide  wine                    0.700357
free sulfur dioxide   total sulfur dioxide    0.720934

edited Apr 8, 2015 at 17:37

answered Apr 8, 2015 at 17:31

Alexander

111k32 gold badges212 silver badges208 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pandas dataframe drop rows by multiindex

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related