239

I have a pandas dataframe with the following column names:

Result1, Test1, Result2, Test2, Result3, Test3, etc...

I want to drop all the columns whose name contains the word "Test". The numbers of such columns is not static but depends on a previous function.

How can I do that?

0

13 Answers 13

335

Here is one way to do this:

df = df[df.columns.drop(list(df.filter(regex='Test')))]
Sign up to request clarification or add additional context in comments.

7 Comments

Or directly in place: df.drop(list(df.filter(regex = 'Test')), axis = 1, inplace = True)
This is a much more elegant solution than the accepted answer. I would break it down a bit more to show why, mainly extracting list(df.filter(regex='Test')) to better show what the line is doing. I would also opt for df.filter(regex='Test').columns over list conversion
I really wonder what the comments saying this answer is "elegant" means. I myself find it quite obfuscated, when python code should first be readable. It also is twice as slower than the first answer. And it uses the regex keyword when the like keyword seems to be more adequate.
This is not actually as good an answer as people claim. The problem with filter is that it returns a copy of ALL the data as columns that you want to drop. It is wasteful if you're only passing this result to drop (which again returns a copy)... a better solution would be str.startswith (I've added an answer with that here).
for multiple conditions, this can be done df.drop(df.filter(regex='Test|Rest|Best').columns, axis=1, inplace=True)
|
143

Cheaper, Faster, and Idiomatic: str.contains

In recent versions of pandas, you can use string methods on the index and columns. Here, str.startswith seems like a good fit.

To remove all columns starting with a given substring:

df.columns.str.startswith('Test')
# array([ True, False, False, False])

df.loc[:,~df.columns.str.startswith('Test')]

  toto test2 riri
0    x     x    x
1    x     x    x

For case-insensitive matching, you can use regex-based matching with str.contains with an SOL anchor:

df.columns.str.contains('^test', case=False)
# array([ True, False,  True, False])

df.loc[:,~df.columns.str.contains('^test', case=False)] 

  toto riri
0    x    x
1    x    x

if mixed-types is a possibility, specify na=False as well.

3 Comments

Hi cs95, can you explain the syntax / thought behind the syntax a bit more? Why do we need to use the colon and comma? Thus why df.loc[:,df....] vs df.loc[df....]?
Where the accepted answer do not work properly for columns ending on _drop in my test data, this solution does work. This should be the accepted answer.
If you want to combine this with the drop method, you can do: df.drop(columns = df.columns[df.columns.str.startswith('Test')], inplace = True)
122
import pandas as pd

import numpy as np

array=np.random.random((2,4))

df=pd.DataFrame(array, columns=('Test1', 'toto', 'test2', 'riri'))

print df

      Test1      toto     test2      riri
0  0.923249  0.572528  0.845464  0.144891
1  0.020438  0.332540  0.144455  0.741412

cols = [c for c in df.columns if c.lower()[:4] != 'test']

df=df[cols]

print df
       toto      riri
0  0.572528  0.144891
1  0.332540  0.741412

1 Comment

The OP didn't specify that the removal should be case insensitive.
50

This can be done neatly in one line with:

df = df.drop(df.filter(regex='Test').columns, axis=1)

4 Comments

Similarly (and faster): df.drop(df.filter(regex='Test').columns, axis=1, inplace=True)
for multiple conditions, this can be done df.drop(df.filter(regex='Test|Rest|Best').columns, axis=1, inplace=True)
Awesome adaptation of the above solution to filter for multiple conditions! Thank you for posting this :)
@MaxGhenis I don't think doing anything with inplace = True can be considered fast these days, given that developers are considering removing this parameter at all.
22

You can filter out the columns you DO want using 'filter'

import pandas as pd
import numpy as np

data2 = [{'test2': 1, 'result1': 2}, {'test': 5, 'result34': 10, 'c': 20}]

df = pd.DataFrame(data2)

df

    c   result1     result34    test    test2
0   NaN     2.0     NaN     NaN     1.0
1   20.0    NaN     10.0    5.0     NaN

Now filter

df.filter(like='result',axis=1)

Get..

   result1  result34
0   2.0     NaN
1   NaN     10.0

2 Comments

Best answer! Thanks. How do you filter opposite ? not like='result'
then do this: df=df.drop(df.filter(like='result',axis=1).columns,axis=1)
13

Using a regex to match all columns not containing the unwanted word:

df = df.filter(regex='^((?!badword).)*$')

Comments

11

Use the DataFrame.select method:

In [38]: df = DataFrame({'Test1': randn(10), 'Test2': randn(10), 'awesome': randn(10)})

In [39]: df.select(lambda x: not re.search('Test\d+', x), axis=1)
Out[39]:
   awesome
0    1.215
1    1.247
2    0.142
3    0.169
4    0.137
5   -0.971
6    0.736
7    0.214
8    0.111
9   -0.214

4 Comments

And the op did not specify that a number had to follow 'Test': I want to drop all the columns whose name contains the word "Test".
The assumption that a number follows Test is perfectly reasonable. Reread the question.
now seeing: FutureWarning: 'select' is deprecated and will be removed in a future release. You can use .loc[labels.map(crit)] as a replacement
Remember to import re beforehand.
9

This method does everything in place. Many of the other answers create copies and are not as efficient:

df.drop(df.columns[df.columns.str.contains('Test')], axis=1, inplace=True)

Comments

8

Question states 'I want to drop all the columns whose name contains the word "Test".'

test_columns = [col for col in df if 'Test' in col]
df.drop(columns=test_columns, inplace=True)

Comments

4

You can use df.filter to get the list of columns that match your string and then use df.drop

resdf = df.drop(df.filter(like='Test',axis=1).columns.to_list(), axis=1)

2 Comments

This was already covered by this answer.
While the answer linked in the above comment is similar, it is not the same. In fact, it's nearly the opposite.
3

I do not recommend using the 'filter' method, because it returns the entire dataframe and not good for larger datasets.

Instead, pandas provides regex filtering of columns using str.match:

df.columns.str.match('.*Test.*')
# array([ True, False, False, False])

(this will return boolean array for 'Test' anywhere in the column names, not just at the start)

Use .loc to designate the columns using the boolean array. Note that '~' inverts the boolean array, since we want to drop (not keep) all those columns that contain 'Test'

df = df.loc[:, ~df.columns.str.match('.*Test.*')]

In this way, only the columns names are needed for the filtering and we never need to return a copy of filtered data. Note there are other str methods that can be done on the column names, like startwith, endswith, but match provides the power of regex so most universal.

Comments

1

Solution when dropping a list of column names containing regex. I prefer this approach because I'm frequently editing the drop list. Uses a negative filter regex for the drop list.

drop_column_names = ['A','B.+','C.*']
drop_columns_regex = '^(?!(?:'+'|'.join(drop_column_names)+')$)'
print('Dropping columns:',', '.join([c for c in df.columns if re.search(drop_columns_regex,c)]))
df = df.filter(regex=drop_columns_regex,axis=1)

Comments

1

Building on my preferred answer by @cs95, combining loc with a lambda function enables a nice clean pipe chain like this:

output_df = (
    input_df
    .stuff
    .more_stuff
    .yet_more_stuff
    .loc[:, lambda x: ~x.columns.str.startswith('Test')]
)

This way you can refer to columns of the dataframe produced by pd.DataFrame.yet_more_stuff, rather than the original dataframe input_df itself, as the columns may have changed (depending, of course, on all the stuff).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.