3

I have a dataframe, df, and a list of strings, cols_needed, which indicate the columns I want to retain in df. The column names in df do not exactly match the strings in cols_needed, so I cannot directly use something like intersection. But the column names do contain the strings in cols_needed. I tried playing around with str.contains but couldn't get it to work. How can I subset df based on cols_needed?

import pandas as pd
df = pd.DataFrame({
    'sim-prod1': [1,2],
    'sim-prod2': [3,4],
    'sim-prod3': [5,6],
    'sim_prod4': [7,8]
})

cols_needed = ['prod1', 'prod2']

# What I want to obtain:
    sim-prod1  sim-prod2
0      1        3
1      2        4

3 Answers 3

3

With the regex option of filter

df.filter(regex='|'.join(cols_needed))

   sim-prod1  sim-prod2
0          1          3
1          2          4
Sign up to request clarification or add additional context in comments.

Comments

3

You can explore str.contains with a joint pattern, for example:

df.loc[:,df.columns.str.contains('|'.join(cols_needed))]

Output:

   sim-prod1  sim-prod2
0          1          3
1          2          4

Comments

3

A list comprehension could work as well:

columns = [cols for cols in df 
           for col in cols_needed 
           if col in cols]

['sim-prod1', 'sim-prod2']

In [110]: df.loc[:, columns]
Out[110]: 
   sim-prod1  sim-prod2
0          1          3
1          2          4

1 Comment

Nice, or just df[columns] in this case

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.