128

I have a dataframe in python pandas. The structure of the dataframe is as the following:

   a    b    c    d1   d2   d3 
   10   14   12   44  45    78

I would like to select the columns which begin with d. Is there a simple way to achieve this in python .

7 Answers 7

215

You can use DataFrame.filter this way:

import pandas as pd

df = pd.DataFrame(np.array([[2,4,4],[4,3,3],[5,9,1]]),columns=['d','t','didi'])
>>
   d  t  didi
0  2  4     4
1  4  3     3
2  5  9     1

df.filter(regex=("d.*"))

>>
   d  didi
0  2     4
1  4     3
2  5     1

The idea is to select columns by regex

Sign up to request clarification or add additional context in comments.

3 Comments

To get a filtered list of just column names df.filter(regex=("d.*")).columns.to_list().
This seems to be very incorrect - if you replace the column name 't' with 'td', then the regex picks up all three columns. It's as if the regex doesn't start at the beginning of the column name. How can this be fixed?
@FarazMasroor search for ^d.*; ^ looks for proceeding regex that starts the string. But I think a more accurate regex in this case -- assuming numbers follow "d" -- is ^d\d+ where \d means digit and + is for 1 or more of preceding character
30

Update

Use .str.startswith on df.columns:

import pandas as pd

df = pd.DataFrame([[10, 14, 12, 44, 45, 78]], columns=['a', 'b', 'c', 'd1', 'd2', 'd3'])

df[df.columns[df.columns.str.startswith('d')]]

Result:

   d1  d2  d3
0  44  45  78

This is a nice solution if you're not comfortable with regular expressions.

Old answer, for pandas pre-v0.21.0

Use select:

df.select(lambda col: col.startswith('d'), axis=1)

Note: select was deprecated as of pandas v0.21.0 - thanks to Venkat for pointing this out in the comments.

1 Comment

Beware that select is now getting deprecated
15

On a larger dataset especially, a vectorized approach is actually MUCH FASTER (by more than two orders of magnitude) and is MUCH more readable. I'm providing a screenshot as proof. (Note: Except for the last few lines I wrote at the bottom to make my point clear with a vectorized approach, the other code was derived from the answer by @Alexander.)

enter image description here

Here's that code for reference:

import pandas as pd
import numpy as np
n = 10000
cols = ['{0}_{1}'.format(letters, number) 
        for number in range(n) for letters in ('d', 't', 'didi')]
df = pd.DataFrame(np.random.randn(30000, n * 3), columns=cols)

%timeit df[[c for c in df if c[0] == 'd']]

%timeit df[[c for c in df if c.startswith('d')]]

%timeit df.select(lambda col: col.startswith('d'), axis=1)

%timeit df.filter(regex=("d.*"))

%timeit df.filter(like='d')

%timeit df.filter(like='d', axis=1)

%timeit df.filter(regex=("d.*"), axis=1)

%timeit df.columns.map(lambda x: x.startswith("d"))

columnVals = df.columns.map(lambda x: x.startswith("d"))

%timeit df.filter(columnVals, axis=1)

4 Comments

I couldn't get your approach to filter my dataframe, using the last 2 lines my result is empty... no columns... does this method still work?
@RachOdwyer I'd think it should work unless perhaps they rolled out a breaking change. If that's the case, please let me know.
a little bit late: you can use df.loc[:, columnVals] instead
This comparison is very misleading. It is so misleading, in fact, that this method is just plain wrong. It is fast because filter is returning an empty dataframe. x.startswith("d") results in True or False, neither of which are column names and hence why the returned dataframe is empty. The correct way to implement your idea is columnVals = df.columns.map(lambda x: x if x.startswith("d") else None). Then you will see after filtering that the time is the same as the other approaches.
7

You can use a list comprehension to iterate over all of the column names in your DataFrame df and then only select those that begin with 'd'.

df = pd.DataFrame({'a': {0: 10}, 'b': {0: 14}, 'c': {0: 12},
                   'd1': {0: 44}, 'd2': {0: 45}, 'd3': {0: 78}})

Use list comprehension to iterate over the columns in the dataframe and return their names (c below is a local variable representing the column name).

>>> [c for c in df]
['a', 'b', 'c', 'd1', 'd2', 'd3']

Then only select those beginning with 'd'.

>>> [c for c in df if c[0] == 'd']  # As an alternative to c[0], use c.startswith(...)
['d1', 'd2', 'd3']

Finally, pass this list of columns to the DataFrame.

df[[c for c in df if c.startswith('d')]]
>>> df
   d1  d2  d3
0  44  45  78

===========================================================================

TIMINGS (added Feb 2018 per comments from devinbost claiming that this method is slow...)

First, lets create a dataframe with 30k columns:

n = 10000
cols = ['{0}_{1}'.format(letters, number) 
        for number in range(n) for letters in ('d', 't', 'didi')]
df = pd.DataFrame(np.random.randn(3, n * 3), columns=cols)
>>> df.shape
(3, 30000)

>>> %timeit df[[c for c in df if c[0] == 'd']]  # Simple list comprehension.
# 10 loops, best of 3: 16.4 ms per loop

>>> %timeit df[[c for c in df if c.startswith('d')]]  # More 'pythonic'?
# 10 loops, best of 3: 29.2 ms per loop

>>> %timeit df.select(lambda col: col.startswith('d'), axis=1)  # Solution of gbrener.
# 10 loops, best of 3: 21.4 ms per loop

>>> %timeit df.filter(regex=("d.*"))  # Accepted solution.
# 10 loops, best of 3: 40 ms per loop

10 Comments

I don't get the code. what is the c in there. and did you test the code, please offer some explanations.
c.startswith('d') is probably more pythonic. Either way I like this!
This is extremely slow. A vectorized approach would be greatly preferred.
@devinbost Your request is a pathetic cheap shot and comes nearly two years after the OP's question. The OP asked "Is there a simple way to achieve this in python", to which my reply would work in the majority of situations. If you have a specific requirement that calls on dataframes with a large number of columns or with many dataframes, then I suggest you ask a question more specific to your needs.
@devinbost, the links you posted refer to optimization row-wise, and this post explicitly asked about selection column-wise, so your ranting about community best practices are really out of place. For common data analysis, columns will rarely be more than a hundred, and there is no need for vectorization.
|
5

You can use the method startswith with index (columns in this case):

df.loc[:, df.columns.str.startswith('d')]

or match with regex:

df.loc[:, df.columns.str.match('^d')]

Comments

3

You can also use

df.filter(regex='^d')

1 Comment

If I want to filter columns endswith d?
0

Get any substring of column names starting with a [abc] until '_', drop any non-matches (NA), remove duplicates and sort.

df.columns.str.extract(r'([abc].*_)', expand=False).dropna().drop_duplicates().sort_values()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.