Pandas - Remove Columns based on values in another dataframe columns

Question

I have a dataframe in pandas called as df_A which in real-time has more than 100 columns.

And, I have another dataframe df_B in which two columns gives me what columns do I need from the df_A

A reproducible example has been given below,

import pandas as pd

d = {'foo':[100, 111, 222], 
     'bar':[333, 444, 555],'foo2':[110, 101, 222], 
     'bar2':[333, 444, 555],'foo3':[100, 111, 222], 
     'bar3':[333, 444, 555]}

df_A = pd.DataFrame(d)

d = {'ReqCol_A':['foo','foo2'], 
     'bar':[333, 444],'foo2':[100, 111], 
     'bar2':[333, 444],'ReqCol_B':['bar3', ''], 
     'bar3':[333, 444]}

df_b = pd.DataFrame(d)

As it can be seen df_b in the above example, the values under ReqCol_A and ReqCol_B is what I am trying to get from df_A

so, my expected output will have three columns from df_A. The three columns will foo foo2 and bar3.

df_C will be the expected output and it will look like

df_C
foo foo2 bar3
100 110  333
111 101  444
222 222  555

Please help me with this. I am struggling to get this.

I'm a little lost on what you are after. Can you provide your expected output? — busybear
– busybear, Commented Jan 17, 2019 at 19:49

Scott Boston · Accepted Answer · 2019-01-17 19:53:09Z

3

Try using filter to get only those columns with 'ReqCol' then stack to get a list and filter the db_A dataframe:

df_A[df_b.filter(like='ReqCol').replace('', np.nan).stack().tolist()]

Output:

   foo  bar3  foo2
0  100   333   100
1  111   444   111
2  222   555   222

answered Jan 17, 2019 at 19:53

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

meW · Accepted Answer · 2019-01-17 20:06:56Z

2

Solution:

# retrieve all the unique elements from your df_b columns (ReqCol_A and ReqCol_B) let it also include nan and other unwanted features
features = set(df_b.ReqCol_A.unique()) | set(df_b.ReqCol_B.unique())

# Taking intersection with df_A column names and fetching the names which need to be targeted
target_features = set(df_A.columns) & features

# Get the Output
df_A.loc[:,target_features]

Performance comparison

Given method:

%%timeit
features = set(df_b.ReqCol_A.unique()) | set(df_b.ReqCol_B.unique())
target_features = set(df_A.columns) & features
df_A.loc[:,target_features]
875 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Second answer (using filter):

%%timeit 
df_A[df_b.filter(like='ReqCol').replace('', np.nan).stack().tolist()]
2.14 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Clearly, the given method is much faster than other.

edited Jan 17, 2019 at 20:06

answered Jan 17, 2019 at 19:52

meW

3,97710 silver badges27 bronze badges

2 Comments

Scott Boston Over a year ago

What if you had more that 10 'ReqCol_x' type them out manually or use filter?

meW Over a year ago

@ScottBoston In that case, I do follow filter however with given constraint I don't think there's a need of filter.

Collectives™ on Stack Overflow

Pandas - Remove Columns based on values in another dataframe columns

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related