1

I'd like to drop rows from a pandas dataframe using the MultiIndex value.

I've tried quite a few things but I put below what I think was closer. (Actually I will explain the full problem since there might be an alternative solutions using a completely different approach). From a correlation matrix, I'd like to get the pair of columns that correlate more. I use unstack and put the results in a dataframe:

In [263]: corr_df = pd.DataFrame(total.corr().unstack())

Then get the higher correlations (actually I should get the negatives as well).

In [264]: high = corr_df[(corr_df[0] > 0.5) & (corr_df[0] < 1.0)]

In [236]: print high
                                                  0
residual sugar       density               0.552517
free sulfur dioxide  total sulfur dioxide  0.720934
total sulfur dioxide free sulfur dioxide   0.720934
                     wine                  0.700357
density              residual sugar        0.552517
wine                 total sulfur dioxide  0.700357

Closed enough, but there are duplicates, that's actually the point of the correlation matrix. In order to clean them up, my idea is to iterate the high values to remove duplicates:

In [267]:
for row in high.iterrows():
    print row[0][0], ",", row[0][1]
    print high.loc[row[0][1]].loc[row[0][0]].index
    high.drop(high.loc[row[0][1]].loc[row[0][0]].index)
residual sugar , density
Int64Index([0], dtype='int64')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-267-1258da2a4772> in <module>()
      2     print row[0][0], ",", row[0][1]
      3     print high.loc[row[0][1]].loc[row[0][0]].index
----> 4     high.drop(high.loc[row[0][1]].loc[row[0][0]].index)

...
[huge stack of errors]
...
KeyError: 0

The method drop is working perfectly when the index is normal (see drop), but, how do I build the label when I got a MultiIndex?

1 Answer 1

2
corr_df = pd.DataFrame(
{'residual sugar': [1, 0, 0, 0.552517, 0], 
'free sulfur dioxide': [0, 1, 0.720934, 0, 0], 
'total sulfur dioxide': [0, 0.720934, 1, 0, 0.700357],
'density': [0.552517, 0, 0, 1, 0],
'wine': [0, 0, 0.700357, 0, 1]}, 
index=['residual sugar', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'wine']).unstack()

# Notice the slight modification to the original
high = corr_df[(corr_df > 0.5) & (corr_df < 1.0)]

# Sort by index, then values
high.sort_index()
high.sort()

# Drop every other value (e.g. just take the evens)
result = high.iloc[[count for count, _ in enumerate(high) if count % 2 == 0]]
>>> result
density               residual sugar          0.552517
total sulfur dioxide  wine                    0.700357
free sulfur dioxide   total sulfur dioxide    0.720934
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.