How to select columns from dataframe by regex

Question

I have a dataframe in python pandas. The structure of the dataframe is as the following:

   a    b    c    d1   d2   d3 
   10   14   12   44  45    78

I would like to select the columns which begin with d. Is there a simple way to achieve this in python .

Eric Leung · Accepted Answer · 2020-05-08 00:31:53Z

215

You can use DataFrame.filter this way:

import pandas as pd

df = pd.DataFrame(np.array([[2,4,4],[4,3,3],[5,9,1]]),columns=['d','t','didi'])
>>
   d  t  didi
0  2  4     4
1  4  3     3
2  5  9     1

df.filter(regex=("d.*"))

>>
   d  didi
0  2     4
1  4     3
2  5     1

The idea is to select columns by regex

edited May 8, 2020 at 0:31

Eric Leung

2,6521 gold badge18 silver badges25 bronze badges

answered Jun 12, 2015 at 17:04

farhawa

10.5k16 gold badges58 silver badges93 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

BSalita Over a year ago

To get a filtered list of just column names df.filter(regex=("d.*")).columns.to_list().

Faraz Masroor Over a year ago

This seems to be very incorrect - if you replace the column name 't' with 'td', then the regex picks up all three columns. It's as if the regex doesn't start at the beginning of the column name. How can this be fixed?

Rimov Over a year ago

@FarazMasroor search for ^d.*; ^ looks for proceeding regex that starts the string. But I think a more accurate regex in this case -- assuming numbers follow "d" -- is ^d\d+ where \d means digit and + is for 1 or more of preceding character

Greg · Accepted Answer · 2023-06-28 23:38:41Z

30

Update

Use .str.startswith on df.columns:

import pandas as pd

df = pd.DataFrame([[10, 14, 12, 44, 45, 78]], columns=['a', 'b', 'c', 'd1', 'd2', 'd3'])

df[df.columns[df.columns.str.startswith('d')]]

Result:

   d1  d2  d3
0  44  45  78

This is a nice solution if you're not comfortable with regular expressions.

Old answer, for pandas pre-v0.21.0

Use select:

df.select(lambda col: col.startswith('d'), axis=1)

Note: select was deprecated as of pandas v0.21.0 - thanks to Venkat for pointing this out in the comments.

edited Jun 28, 2023 at 23:38

answered Jun 12, 2015 at 17:12

Greg

5,9452 gold badges20 silver badges20 bronze badges

1 Comment

Venkat Ramana Over a year ago

Beware that select is now getting deprecated

devinbost · Accepted Answer · 2018-09-03 21:01:39Z

15

On a larger dataset especially, a vectorized approach is actually MUCH FASTER (by more than two orders of magnitude) and is MUCH more readable. I'm providing a screenshot as proof. (Note: Except for the last few lines I wrote at the bottom to make my point clear with a vectorized approach, the other code was derived from the answer by @Alexander.)

Here's that code for reference:

import pandas as pd
import numpy as np
n = 10000
cols = ['{0}_{1}'.format(letters, number) 
        for number in range(n) for letters in ('d', 't', 'didi')]
df = pd.DataFrame(np.random.randn(30000, n * 3), columns=cols)

%timeit df[[c for c in df if c[0] == 'd']]

%timeit df[[c for c in df if c.startswith('d')]]

%timeit df.select(lambda col: col.startswith('d'), axis=1)

%timeit df.filter(regex=("d.*"))

%timeit df.filter(like='d')

%timeit df.filter(like='d', axis=1)

%timeit df.filter(regex=("d.*"), axis=1)

%timeit df.columns.map(lambda x: x.startswith("d"))

columnVals = df.columns.map(lambda x: x.startswith("d"))

%timeit df.filter(columnVals, axis=1)

answered Sep 3, 2018 at 21:01

devinbost

5,1443 gold badges52 silver badges64 bronze badges

4 Comments

Rach Odwyer Over a year ago

I couldn't get your approach to filter my dataframe, using the last 2 lines my result is empty... no columns... does this method still work?

devinbost Over a year ago

@RachOdwyer I'd think it should work unless perhaps they rolled out a breaking change. If that's the case, please let me know.

innovatism Over a year ago

a little bit late: you can use df.loc[:, columnVals] instead

Alexander Over a year ago

This comparison is very misleading. It is so misleading, in fact, that this method is just plain wrong. It is fast because filter is returning an empty dataframe. x.startswith("d") results in True or False, neither of which are column names and hence why the returned dataframe is empty. The correct way to implement your idea is columnVals = df.columns.map(lambda x: x if x.startswith("d") else None). Then you will see after filtering that the time is the same as the other approaches.

Alexander · Accepted Answer · 2018-02-09 02:44:49Z

7

You can use a list comprehension to iterate over all of the column names in your DataFrame df and then only select those that begin with 'd'.

df = pd.DataFrame({'a': {0: 10}, 'b': {0: 14}, 'c': {0: 12},
                   'd1': {0: 44}, 'd2': {0: 45}, 'd3': {0: 78}})

Use list comprehension to iterate over the columns in the dataframe and return their names (c below is a local variable representing the column name).

>>> [c for c in df]
['a', 'b', 'c', 'd1', 'd2', 'd3']

Then only select those beginning with 'd'.

>>> [c for c in df if c[0] == 'd']  # As an alternative to c[0], use c.startswith(...)
['d1', 'd2', 'd3']

Finally, pass this list of columns to the DataFrame.

df[[c for c in df if c.startswith('d')]]
>>> df
   d1  d2  d3
0  44  45  78

===========================================================================

TIMINGS (added Feb 2018 per comments from devinbost claiming that this method is slow...)

First, lets create a dataframe with 30k columns:

n = 10000
cols = ['{0}_{1}'.format(letters, number) 
        for number in range(n) for letters in ('d', 't', 'didi')]
df = pd.DataFrame(np.random.randn(3, n * 3), columns=cols)
>>> df.shape
(3, 30000)

>>> %timeit df[[c for c in df if c[0] == 'd']]  # Simple list comprehension.
# 10 loops, best of 3: 16.4 ms per loop

>>> %timeit df[[c for c in df if c.startswith('d')]]  # More 'pythonic'?
# 10 loops, best of 3: 29.2 ms per loop

>>> %timeit df.select(lambda col: col.startswith('d'), axis=1)  # Solution of gbrener.
# 10 loops, best of 3: 21.4 ms per loop

>>> %timeit df.filter(regex=("d.*"))  # Accepted solution.
# 10 loops, best of 3: 40 ms per loop

edited Feb 9, 2018 at 2:44

answered Jun 12, 2015 at 16:59

Alexander

111k32 gold badges212 silver badges208 bronze badges

10 Comments

Yan Song Over a year ago

I don't get the code. what is the c in there. and did you test the code, please offer some explanations.

Adam Smith Over a year ago

c.startswith('d') is probably more pythonic. Either way I like this!

devinbost Over a year ago

This is extremely slow. A vectorized approach would be greatly preferred.

Alexander Over a year ago

@devinbost Your request is a pathetic cheap shot and comes nearly two years after the OP's question. The OP asked "Is there a simple way to achieve this in python", to which my reply would work in the majority of situations. If you have a specific requirement that calls on dataframes with a large number of columns or with many dataframes, then I suggest you ask a question more specific to your needs.

Louis R Over a year ago

@devinbost, the links you posted refer to optimization row-wise, and this post explicitly asked about selection column-wise, so your ranting about community best practices are really out of place. For common data analysis, columns will rarely be more than a hundred, and there is no need for vectorization.

|

Mykola Zotko · Accepted Answer · 2021-10-25 18:09:45Z

5

You can use the method startswith with index (columns in this case):

df.loc[:, df.columns.str.startswith('d')]

or match with regex:

df.loc[:, df.columns.str.match('^d')]

edited Oct 25, 2021 at 18:09

answered Jun 22, 2021 at 16:57

Mykola Zotko

18.2k6 gold badges88 silver badges90 bronze badges

Comments

prafi · Accepted Answer · 2018-02-02 07:04:07Z

3

You can also use

df.filter(regex='^d')

answered Feb 2, 2018 at 7:04

prafi

9909 silver badges11 bronze badges

1 Comment

ah bon Over a year ago

If I want to filter columns endswith d?

BSalita · Accepted Answer · 2021-06-27 18:50:28Z

0

Get any substring of column names starting with a [abc] until '_', drop any non-matches (NA), remove duplicates and sort.

df.columns.str.extract(r'([abc].*_)', expand=False).dropna().drop_duplicates().sort_values()

answered Jun 27, 2021 at 18:50

BSalita

9,07111 gold badges59 silver badges75 bronze badges

Collectives™ on Stack Overflow

How to select columns from dataframe by regex

7 Answers 7

3 Comments

Update

Old answer, for pandas pre-v0.21.0

1 Comment

4 Comments

10 Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

3 Comments

Update

Old answer, for pandas pre-v0.21.0

1 Comment

4 Comments

10 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related