1
data = {'a':['a','b','c','d','e','f','g'],
        'b':['Y','N','Y','Y','Y','N','Y'],
        'c':['Qualified','Unqualified','Qualified','Unqualified','Qualified','Unqualified','Qualified']}
df = pd.DataFrame(data)

df_para = {'Y/N':['','y','n'],
        'Q/U':['unqualified','','unqualified']}
df_para = pd.DataFrame(df_para)

I would like to filter the df using df_para, My code is:

df_output = pd.DataFrame()

for para in df_para.iterrows():
    df_result = df
     # filter Q/U
    if '' not in df_para['Q/U']:
        mask_qu = df_result['c'].str.lower().isin(df_para['Q/U'])
        df_result = df_result.loc[(mask_qu)]
        
    # filter Y/N
    if '' not in df_para['Y/N']:
        mask_yn = df_result['b'].str.lower().isin(df_para['Y/N'])
        df_result = df_result.loc[(mask_yn)]

    df_output = df_output.append(df_result)

If I use my code, it returns all rows within df three times. However, the df_output should be like:

   a   b   c
1   b   N   Unqualified
3   d   Y   Unqualified
5   f   N   Unqualified
0   a   Y   Qualified
2   c   Y   Qualified
3   d   Y   Unqualified
4   e   Y   Qualified
6   g   Y   Qualified
1   b   N   Unqualified
5   f   N   Unqualified

How could I fix it?

5
  • What is output if 'Q/U':['unqualified','unqualified'] ? Commented Jun 23, 2022 at 5:59
  • so does this do the trick df.query('c == "Unqualified"') Commented Jun 23, 2022 at 6:00
  • 1
    @jezrael do you mean if df_para = {'Y/N':[np.nan,np.nan], 'Q/U':['unqualified','unqualified']}? Commented Jun 23, 2022 at 6:07
  • @yiyangchen - yes, exactly. Then ouput is same like {'Y/N':[np.nan], 'Q/U':['unqualified']} ? Commented Jun 23, 2022 at 6:12
  • @jezrael I update my post, please check Commented Jun 23, 2022 at 6:20

2 Answers 2

1

Reason is in operator test indices:

Using the Python in operator on a Series tests for membership in the index, not membership among the values.

If this behavior is surprising, keep in mind that using in on a Python dictionary tests keys, not values, and Series are dict-like.

#pairs for filtering
cols = [('c','Q/U'), ('b','Y/N')]

#for each unique value in df_para filter rows in list
dfs = [df[df[a].str.lower().eq(x)] for a, b in cols for x in df_para[b].unique()]

#join subDataFrames
df_out = pd.concat(dfs)
print (df_out)
   a  b            c
1  b  N  Unqualified
3  d  Y  Unqualified
5  f  N  Unqualified
0  a  Y    Qualified
2  c  Y    Qualified
3  d  Y  Unqualified
4  e  Y    Qualified
6  g  Y    Qualified
1  b  N  Unqualified
5  f  N  Unqualified
Sign up to request clarification or add additional context in comments.

6 Comments

I see how it works, but the actual df_para has more than 20 columns, if I use boolean indexing, the code would be so reduntant. Ideally, for every iteration, if a given cell is empty or NaN, it should pass to the next filtering condition.
@yiyangchen - So need test 20 columns rom df_result by 20 columns from df_para ?
I tried using your updated method, but the actual df has more than 100000 rows and df_para has more than 20 columns. I have updated my posts, could you please check one more time? I am new to pandas, sorry about that.
@yiyangchen - is possible first filter by first column, then by second, column... Like edited answer?
It is possible, but why the row index 1,3,6 return two times? I only need them once. Other than that, it is perfect:)
|
0

try this:

import pandas as pd
import numpy as np

data = {'a':['a','b','c','d','e','f','g'],
        'b':['Y','N','Y','Y','Y','N','Y'],
        'c':['Qualified','Unqualified','Qualified','Unqualified','Qualified','Qualified','Unqualified']}
df = pd.DataFrame(data)


df_result = df[df["c"] == "Unqualified"]
print(df_result)
print(type(df_result))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.