1

I have a DataFrame like this:

data = {'Name':['Tom', 'Jack', 'nick', 'juli', 'Tom', 'nick', 'juli','nick', 'juli','Tom'], 'subject': ['eng', 'maths', 'geo', 'maths', 'science', 'geo', 'maths', 'maths', 'geo', 'science'], 'marks':[99, 98, 95, 90, 99, 98, 97, 95, 96, 98]}
df1 = pd.DataFrame(data)

df1

 Name    subject    marks
0   Tom     eng     99
1   Jack    maths   98
2   nick    geo     95
3   juli    maths   90
4   Tom     science 99
5   nick    geo     98
6   juli    maths   97
7   nick    maths   95
8   juli    geo     96
9   Tom     science 98

another dataframe as :

data2 = {'Name':['Jack', 'nick', 'Tom',  'juli', 'Tom', 'nick','nick', 'juli'], 'subject': ['eng', 'maths', 'geo', 'maths', 'science', 'geo',  'maths', 'geo']}
df2 = pd.DataFrame(data2)


df2

    Name    subject
0   Jack    eng
1   nick    maths
2   Tom     geo
3   juli    maths
4   Tom     science
5   nick    geo
6   nick    maths
7   juli    geo

I want to filter df2 based on combination of 'Names' and 'subject' in df1. If a particular combination of 'Name' and 'subject' in df1 appears more than once and then it is matched in df2. If it matches then we get those rows from df2 as output.

Desired output:

pd.DataFrame({'Names':['Tom', 'juli', 'nick'], 'subject': ['science', 'maths', 'geo']})

    Name    subject 
0   nick    geo 
1   juli    maths
2   Tom     science

can anyone help without using 'merge' option?

1 Answer 1

1

I believe you need filter rows with duplicated values only by DataFrame.duplicated with keep=False chained without this parameter and for them first rows and then use merge for inner join:

df11 = df1[df1.duplicated(subset=['Name','subject'], keep=False) & 
           df1.duplicated(subset=['Name','subject'])]

df3 = df11.merge(df2, suffixes=('_',''))[df2.columns]
print (df3)
   Name  subject
0  nick      geo
1  juli    maths
2   Tom  science

Another similar idea is filter columns by df2 in merge:

cols = df2.columns
df11 = df1.loc[df1[cols].duplicated(keep=False) & df1[cols].duplicated(), cols]
df3 = df11.merge(df2)
print (df3)
   Name  subject
0  nick      geo
1  juli    maths
2   Tom  science
    
Sign up to request clarification or add additional context in comments.

2 Comments

this works for the particular case asked. I was looking for a generalized solution, for example, what to do if I have to look for the combination 'Name' and 'subject' in df1 appearing more than twice/thrice instead of 'once'.
@sak - Then change df1.duplicated(subset=['Name','subject'], keep=False) to df1.groupby(['Name','subject'])['Name'].transform('size').gt(2) for greater like 2

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.