I am trying to identify the index of a column name based on a matching regex condition. However df.columns produces an object of Type pandas.core.indexes.base.Index, that does not have indexes. I want the index so that i can slice the df to get rid of columns that I don't need.
Here is a worked example:
#create a df with column names
df = pd.DataFrame(columns = ['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'])
df.columns
> Index(['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'], dtype='object')
I want to get rid of all the metadata columns.
On a series I would try something like:
df.columns[df.columns.str.contains('meta')].index[0]
> ValueError: Cannot mask with non-boolean array containing NA / NaN values
So I try changing with .astype('str'):
df.columns.astype('str')[df.columns.astype('str').str.contains('meta')].index[0]
> AttributeError: 'Index' object has no attribute 'index'
So my Index has no method .index. So I am left to convert to a list and enumerate with re condition:
[i for i, item in enumerate(df.columns.astype('str').to_list()) if re.findall('meta', item)]
> [4, 5]
This works so I can do the following:
cutoff = [i for i, item in enumerate(df.columns.astype('str').to_list()) if re.findall('meta', item)][0]
df = df.iloc[:,:cutoff]
This however seems extraordinary for such a menial task. In R this would be as simple as:
cutoff <- min(grep('meta', colnames(df))) - 1 #-1 to address non-zero indexing
df <- df[, seq(1, cutoff)]
Is there no easier way to do this in pandas, other than to 1) convert to string, 2) convert to list, 3) enumerate list? Essentially I would have thought there was an equivalent of the min(grep('meta', colnames(df))) - 1 line.
[4, 5][0]ormin([4, 5]))nato False; that should return the index for you:df.columns[df.columns.str.contains('meta', na = False)]. To get rid of themeta, you can use a~.df.loc[:, ~df.columns.str.contains('meta', na = False)]