0

I have a dataframe in pandas called as df_A which in real-time has more than 100 columns.

And, I have another dataframe df_B in which two columns gives me what columns do I need from the df_A

A reproducible example has been given below,

import pandas as pd

d = {'foo':[100, 111, 222], 
     'bar':[333, 444, 555],'foo2':[110, 101, 222], 
     'bar2':[333, 444, 555],'foo3':[100, 111, 222], 
     'bar3':[333, 444, 555]}

df_A = pd.DataFrame(d)

d = {'ReqCol_A':['foo','foo2'], 
     'bar':[333, 444],'foo2':[100, 111], 
     'bar2':[333, 444],'ReqCol_B':['bar3', ''], 
     'bar3':[333, 444]}

df_b = pd.DataFrame(d)

As it can be seen df_b in the above example, the values under ReqCol_A and ReqCol_B is what I am trying to get from df_A

so, my expected output will have three columns from df_A. The three columns will foo foo2 and bar3.

df_C will be the expected output and it will look like

df_C
foo foo2 bar3
100 110  333
111 101  444
222 222  555

Please help me with this. I am struggling to get this.

2
  • 1
    I'm a little lost on what you are after. Can you provide your expected output? Commented Jan 17, 2019 at 19:49
  • @busybear added to the question now. Thank you. Commented Jan 17, 2019 at 19:50

2 Answers 2

3

Try using filter to get only those columns with 'ReqCol' then stack to get a list and filter the db_A dataframe:

df_A[df_b.filter(like='ReqCol').replace('', np.nan).stack().tolist()]

Output:

   foo  bar3  foo2
0  100   333   100
1  111   444   111
2  222   555   222
Sign up to request clarification or add additional context in comments.

Comments

2

Solution:

# retrieve all the unique elements from your df_b columns (ReqCol_A and ReqCol_B) let it also include nan and other unwanted features
features = set(df_b.ReqCol_A.unique()) | set(df_b.ReqCol_B.unique())

# Taking intersection with df_A column names and fetching the names which need to be targeted
target_features = set(df_A.columns) & features

# Get the Output
df_A.loc[:,target_features]

Performance comparison

Given method:

%%timeit
features = set(df_b.ReqCol_A.unique()) | set(df_b.ReqCol_B.unique())
target_features = set(df_A.columns) & features
df_A.loc[:,target_features]
875 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Second answer (using filter):

%%timeit 
df_A[df_b.filter(like='ReqCol').replace('', np.nan).stack().tolist()]
2.14 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Clearly, the given method is much faster than other.

2 Comments

What if you had more that 10 'ReqCol_x' type them out manually or use filter?
@ScottBoston In that case, I do follow filter however with given constraint I don't think there's a need of filter.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.