Find index for column name matching regex in pandas

Question

I am trying to identify the index of a column name based on a matching regex condition. However df.columns produces an object of Type pandas.core.indexes.base.Index, that does not have indexes. I want the index so that i can slice the df to get rid of columns that I don't need.

Here is a worked example:

#create a df with column names
df = pd.DataFrame(columns = ['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'])

df.columns
> Index(['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'], dtype='object')

I want to get rid of all the metadata columns.

On a series I would try something like:

df.columns[df.columns.str.contains('meta')].index[0]
> ValueError: Cannot mask with non-boolean array containing NA / NaN values

So I try changing with .astype('str'):

df.columns.astype('str')[df.columns.astype('str').str.contains('meta')].index[0]
> AttributeError: 'Index' object has no attribute 'index'

So my Index has no method .index. So I am left to convert to a list and enumerate with re condition:

[i for i, item in enumerate(df.columns.astype('str').to_list()) if re.findall('meta', item)]
> [4, 5]

This works so I can do the following:

cutoff = [i for i, item in enumerate(df.columns.astype('str').to_list()) if re.findall('meta', item)][0]
df = df.iloc[:,:cutoff]

This however seems extraordinary for such a menial task. In R this would be as simple as:

cutoff <- min(grep('meta', colnames(df))) - 1 #-1 to address non-zero indexing
df <- df[, seq(1, cutoff)]

Is there no easier way to do this in pandas, other than to 1) convert to string, 2) convert to list, 3) enumerate list? Essentially I would have thought there was an equivalent of the min(grep('meta', colnames(df))) - 1 line.

what is the expected output please? it will be easy to get to the answer if the expected output is clearly articulated — Joe Ferndz
– Joe Ferndz, Commented Mar 19, 2021 at 1:01
Sorry, I have the output there - as I get the answer just in a laborious manner. Ideal output is: 4 ([4, 5][0] or min([4, 5])) — MorrisseyJ
– MorrisseyJ, Commented Mar 19, 2021 at 1:07
For the string contains, set na to False; that should return the index for you: df.columns[df.columns.str.contains('meta', na = False)]. To get rid of the meta, you can use a ~. df.loc[:, ~df.columns.str.contains('meta', na = False)] — sammywemmy
– sammywemmy, Commented Mar 19, 2021 at 1:22

user15398259 · Accepted Answer · 2021-03-19 01:26:18Z

3

You can combine .drop() and .filter()

>>> df.filter(like='meta')
  metadata1 metadat2
0         e        f
1         k        l
>>> df.drop(columns=df.filter(like='meta'))
  Country 2010 2011 2012
0       a    b    c    d
1       g    h    i    j

You can also use regex= to find all columns without meta

>>> df.filter(regex='^(?:(?!meta).)+$')
  Country 2010 2011 2012
0       a    b    c    d
1       g    h    i    j

answered Mar 19, 2021 at 1:26

user15398259

Sign up to request clarification or add additional context in comments.

1 Comment

MorrisseyJ Over a year ago

Thanks for this. It works on the example i provided, but imagine that I am importing a variety of datasets and cleaning them and that there are all sorts of columns after the first one containing 'meta'. Imagine its so bad that this makes using regex or like in such a way that doesn't exclude 'Country' extremely challenging. For this reason, I just want to find the first index matching 'meta' (which is always the first thing I want to get rid of) and slice so as to get rid of everything after that. Nonetheless I can use the approach you use here in other cases.

sammywemmy · Accepted Answer · 2021-03-19 02:14:33Z

1

Here is a solution that may fulfill your needs:

import pandas as pd

df = pd.DataFrame(columns = ['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'])

df_ = df.columns.to_frame(index=False, name='index')
matched = df_.loc[
   df_['index'].str.contains(r'metadata\d+|metadat\d+', na=False)
].index.values

print(matched)

Output:

array([4, 5])

You could also use the get_indexer method, to get the index positions:

df.columns.get_indexer(df.columns[df.columns.str.contains('meta', na = False)])
array([4, 5])

If you just want to filter out the meta columns, you can use boolean indexing in loc:

df.loc[:, ~df.columns.str.contains('meta', na = False)]
 
Empty DataFrame
Columns: [Country, 2010, 2011, 2012]
Index: []

edited Mar 19, 2021 at 2:14

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

answered Mar 19, 2021 at 1:25

Chiheb Nexus

9,2774 gold badges33 silver badges45 bronze badges

9 Comments

sammywemmy Over a year ago

set na to False in string contains; it removes the need for the replace function

Chiheb Nexus Over a year ago

@sammywemmy Yes you're right. I've updated my answer. Thanks.

sammywemmy Over a year ago

I think ultimately the user wants to filter out the meta, so maybe reconstruct your answer to fit that. Also, your code could be reworked, without converting to a frame and using loc; you can use pandas index methods : df.columns.get_indexer(df.columns[df.columns.str.contains('meta', na = False)]). Of course this is unnecessary, since you can just select the columns with boolean indexing: df.loc[:, ~df.columns.str.contains('meta', na = False)]

Chiheb Nexus Over a year ago

@sammywemmy Thanks for your comment. Your answer is more shorter than mine!. However what i understood from the OP; i think he/she is searching for column indexes at first place and i don't know what he/she will do after this.

sammywemmy Over a year ago

@ChihebNexus, I chose instead to add it to your answer; hope that is ok with you

|

Collectives™ on Stack Overflow

Find index for column name matching regex in pandas

2 Answers 2

1 Comment

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related