2

I have a dataframe with a large multiindex, sourced from a vast number of csv files. Some of those files have errors in the various labels, ie. "window" is missspelled as "winZZw", which then causes problems when I select all windows with df.xs('window', level='middle', axis=1).

So I need a way to simply replace winZZw with window.

Here's a very minimal sample df: (lets assume the data and the 'roof', 'window'… strings come from some convoluted text reader)

header = pd.MultiIndex.from_product(['roof', 'window', 'basement'], names = ['top', 'middle', 'bottom'])
dates = pd.date_range('01/01/2000','01/12/2010', freq='MS')
data = np.random.randn(len(dates))
df = pd.DataFrame(data, index=dates, columns=header)
header2 = pd.MultiIndex.from_product(['roof', 'winZZw', 'basement'], names = ['top', 'middle', 'bottom'])
data = 3*(np.random.randn(len(dates)))
df2 = pd.DataFrame(data, index=dates, columns=header2)
df = pd.concat([df, df2], axis=1)
header3 = pd.MultiIndex.from_product(['roof', 'door', 'basement'], names = ['top', 'middle', 'bottom'])
data = 2*(np.random.randn(len(dates)))
df3 = pd.DataFrame(data, index=dates, columns=header3)
df = pd.concat([df, df3], axis=1)

Now I want to xs a new dataframe for all the houses that have a window at their middle level: windf = df.xs('window', level='middle', axis=1)

But this obviously misses the misspelled winZZw.

So, how I replace winZZw with window?

The only way I found was to use set_levels, but if I understood that correctly, I need to feed it the whole level, ie

df.columns.set_levels([u'window',u'window', u'door'], level='middle',inplace=True)

but this has two issues:

  • I need to pass it the whole index, which is easy in this sample, but impossible/stupid for a thousand column df with hundreds of labels.
  • It seems to need the list backwards (now, my first entry in the df has door in the middle, instead of the window it had). That can probably be fixed, but it seems weird

I can work around these issues by xsing a new df of only winZZws, and then setting the levels with set_levels(df.shape[1]*[u'window'], level='middle') and then concatting it together again, but I'd like to have something more straightforward analog to str.replace('winZZw', 'window'), but I can't figure out how.

3
  • it seems the code contains error, please check first. MultiIndex.from_product needs list of list as input. Commented Jul 11, 2018 at 8:40
  • In @jezraels answer, he changed ['roof', 'window', 'basement'] to [['roof'],[ 'window'], ['basement']] to make it work. So perhaps you are using a pandas that is too old. Commented Jul 11, 2018 at 8:57
  • yep, that probably is the issue. Commented Jul 11, 2018 at 8:58

2 Answers 2

2

Use rename with specifying level:

header = pd.MultiIndex.from_product([['roof'],[ 'window'], ['basement']], names = ['top', 'middle', 'bottom'])
dates = pd.date_range('01/01/2000','01/12/2010', freq='MS')
data = np.random.randn(len(dates))
df = pd.DataFrame(data, index=dates, columns=header)
header2 = pd.MultiIndex.from_product([['roof'], ['winZZw'], ['basement']], names = ['top', 'middle', 'bottom'])
data = 3*(np.random.randn(len(dates)))
df2 = pd.DataFrame(data, index=dates, columns=header2)
df = pd.concat([df, df2], axis=1)
header3 = pd.MultiIndex.from_product([['roof'], ['door'], ['basement']], names = ['top', 'middle', 'bottom'])
data = 2*(np.random.randn(len(dates)))
df3 = pd.DataFrame(data, index=dates, columns=header3)
df = pd.concat([df, df3], axis=1)

df = df.rename(columns={'winZZw':'window'}, level='middle')
print(df.head())

top             roof                    
middle        window                door
bottom      basement  basement  basement
2000-01-01 -0.131052 -1.189049  1.310137
2000-02-01 -0.200646  1.893930  2.124765
2000-03-01 -1.690123 -2.128965  1.639439
2000-04-01 -0.794418  0.605021 -2.810978
2000-05-01  1.528002 -0.286614  0.736445
Sign up to request clarification or add additional context in comments.

3 Comments

I admit, I forgot to test it with the sample, since it failed with TypeError: rename() got an unexpected keyword argument "level" on my real data, which seemed to indicate to me that rename simply cant work on the index.
Damn, still on 0.18. I really should not have started working with something that is not stable…
0

A more general solution to replace a string within a multiindex is the following

df.columns = pd.MultiIndex.from_tuples([tuple([x.replace("to_replace", "new_str") for x in tuple_index]) for tuple_index in df.columns])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.