1

I am trying to identify the index of a column name based on a matching regex condition. However df.columns produces an object of Type pandas.core.indexes.base.Index, that does not have indexes. I want the index so that i can slice the df to get rid of columns that I don't need.

Here is a worked example:

#create a df with column names
df = pd.DataFrame(columns = ['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'])

df.columns
> Index(['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'], dtype='object')

I want to get rid of all the metadata columns.

On a series I would try something like:

df.columns[df.columns.str.contains('meta')].index[0]
> ValueError: Cannot mask with non-boolean array containing NA / NaN values

So I try changing with .astype('str'):

df.columns.astype('str')[df.columns.astype('str').str.contains('meta')].index[0]
> AttributeError: 'Index' object has no attribute 'index'

So my Index has no method .index. So I am left to convert to a list and enumerate with re condition:

[i for i, item in enumerate(df.columns.astype('str').to_list()) if re.findall('meta', item)]
> [4, 5]

This works so I can do the following:

cutoff = [i for i, item in enumerate(df.columns.astype('str').to_list()) if re.findall('meta', item)][0]
df = df.iloc[:,:cutoff]

This however seems extraordinary for such a menial task. In R this would be as simple as:

cutoff <- min(grep('meta', colnames(df))) - 1 #-1 to address non-zero indexing
df <- df[, seq(1, cutoff)]

Is there no easier way to do this in pandas, other than to 1) convert to string, 2) convert to list, 3) enumerate list? Essentially I would have thought there was an equivalent of the min(grep('meta', colnames(df))) - 1 line.

3
  • what is the expected output please? it will be easy to get to the answer if the expected output is clearly articulated Commented Mar 19, 2021 at 1:01
  • Sorry, I have the output there - as I get the answer just in a laborious manner. Ideal output is: 4 ([4, 5][0] or min([4, 5])) Commented Mar 19, 2021 at 1:07
  • For the string contains, set na to False; that should return the index for you: df.columns[df.columns.str.contains('meta', na = False)]. To get rid of the meta, you can use a ~. df.loc[:, ~df.columns.str.contains('meta', na = False)] Commented Mar 19, 2021 at 1:22

2 Answers 2

3

You can combine .drop() and .filter()

>>> df.filter(like='meta')
  metadata1 metadat2
0         e        f
1         k        l
>>> df.drop(columns=df.filter(like='meta'))
  Country 2010 2011 2012
0       a    b    c    d
1       g    h    i    j

You can also use regex= to find all columns without meta

>>> df.filter(regex='^(?:(?!meta).)+$')
  Country 2010 2011 2012
0       a    b    c    d
1       g    h    i    j
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for this. It works on the example i provided, but imagine that I am importing a variety of datasets and cleaning them and that there are all sorts of columns after the first one containing 'meta'. Imagine its so bad that this makes using regex or like in such a way that doesn't exclude 'Country' extremely challenging. For this reason, I just want to find the first index matching 'meta' (which is always the first thing I want to get rid of) and slice so as to get rid of everything after that. Nonetheless I can use the approach you use here in other cases.
1

Here is a solution that may fulfill your needs:

import pandas as pd

df = pd.DataFrame(columns = ['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'])

df_ = df.columns.to_frame(index=False, name='index')
matched = df_.loc[
   df_['index'].str.contains(r'metadata\d+|metadat\d+', na=False)
].index.values

print(matched)

Output:

array([4, 5])

You could also use the get_indexer method, to get the index positions:

df.columns.get_indexer(df.columns[df.columns.str.contains('meta', na = False)])
array([4, 5])

If you just want to filter out the meta columns, you can use boolean indexing in loc:

df.loc[:, ~df.columns.str.contains('meta', na = False)]
 
Empty DataFrame
Columns: [Country, 2010, 2011, 2012]
Index: []

9 Comments

set na to False in string contains; it removes the need for the replace function
@sammywemmy Yes you're right. I've updated my answer. Thanks.
I think ultimately the user wants to filter out the meta, so maybe reconstruct your answer to fit that. Also, your code could be reworked, without converting to a frame and using loc; you can use pandas index methods : df.columns.get_indexer(df.columns[df.columns.str.contains('meta', na = False)]). Of course this is unnecessary, since you can just select the columns with boolean indexing: df.loc[:, ~df.columns.str.contains('meta', na = False)]
@sammywemmy Thanks for your comment. Your answer is more shorter than mine!. However what i understood from the OP; i think he/she is searching for column indexes at first place and i don't know what he/she will do after this.
@ChihebNexus, I chose instead to add it to your answer; hope that is ok with you
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.