Filter dataframes based on 2 columns of another dataframe in python

Question

I have a DataFrame like this:

data = {'Name':['Tom', 'Jack', 'nick', 'juli', 'Tom', 'nick', 'juli','nick', 'juli','Tom'], 'subject': ['eng', 'maths', 'geo', 'maths', 'science', 'geo', 'maths', 'maths', 'geo', 'science'], 'marks':[99, 98, 95, 90, 99, 98, 97, 95, 96, 98]}
df1 = pd.DataFrame(data)

df1

 Name    subject    marks
0   Tom     eng     99
1   Jack    maths   98
2   nick    geo     95
3   juli    maths   90
4   Tom     science 99
5   nick    geo     98
6   juli    maths   97
7   nick    maths   95
8   juli    geo     96
9   Tom     science 98

another dataframe as :

data2 = {'Name':['Jack', 'nick', 'Tom',  'juli', 'Tom', 'nick','nick', 'juli'], 'subject': ['eng', 'maths', 'geo', 'maths', 'science', 'geo',  'maths', 'geo']}
df2 = pd.DataFrame(data2)


df2

    Name    subject
0   Jack    eng
1   nick    maths
2   Tom     geo
3   juli    maths
4   Tom     science
5   nick    geo
6   nick    maths
7   juli    geo

I want to filter df2 based on combination of 'Names' and 'subject' in df1. If a particular combination of 'Name' and 'subject' in df1 appears more than once and then it is matched in df2. If it matches then we get those rows from df2 as output.

Desired output:

pd.DataFrame({'Names':['Tom', 'juli', 'nick'], 'subject': ['science', 'maths', 'geo']})

    Name    subject 
0   nick    geo 
1   juli    maths
2   Tom     science

can anyone help without using 'merge' option?

jezrael · Accepted Answer · 2020-10-05 08:34:20Z

1

I believe you need filter rows with duplicated values only by DataFrame.duplicated with keep=False chained without this parameter and for them first rows and then use merge for inner join:

df11 = df1[df1.duplicated(subset=['Name','subject'], keep=False) & 
           df1.duplicated(subset=['Name','subject'])]

df3 = df11.merge(df2, suffixes=('_',''))[df2.columns]
print (df3)
   Name  subject
0  nick      geo
1  juli    maths
2   Tom  science

Another similar idea is filter columns by df2 in merge:

cols = df2.columns
df11 = df1.loc[df1[cols].duplicated(keep=False) & df1[cols].duplicated(), cols]
df3 = df11.merge(df2)
print (df3)
   Name  subject
0  nick      geo
1  juli    maths
2   Tom  science

edited Oct 5, 2020 at 8:34

answered Oct 5, 2020 at 7:58

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

sameer Over a year ago

this works for the particular case asked. I was looking for a generalized solution, for example, what to do if I have to look for the combination 'Name' and 'subject' in df1 appearing more than twice/thrice instead of 'once'.

jezrael Over a year ago

@sak - Then change df1.duplicated(subset=['Name','subject'], keep=False) to df1.groupby(['Name','subject'])['Name'].transform('size').gt(2) for greater like 2

Collectives™ on Stack Overflow

Filter dataframes based on 2 columns of another dataframe in python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related