0

I have a Dataframe df1 (original table) with multiple columns.

I have a filtered DataFrame df2 with columns date, agent_id, gps1, gps2 only 4 columns.

In df1, I have date, agent_id and final_gps along with other columns.

I want to filter all the data from df1 which exists in df2, I want to compare based on.

Df1.date == df2.date & df1.agent_id == df2.agent_id & (df1.final_gps == df2.gps1 or df1.final_gps == df2.gps2)

df2 sample

date       agent_id  gps1    gps2
14-02-2020  12abc   (1,2)   (7,6)
14-02-2020  12abc   (3,4)   (7,6)
14-02-2020  33bcd   (6,7)   (8,9)
20-02-2020  44hgf   (1,6)   (3,7)
20-02-2020  12abc   (3,5)   (3,1)
20-02-2020  33bcd   (3,4)   (3,6)
21-02-2020  12abc   (4,5)   (5,4)

df1 sample

date       agent_id final_gps   ….
10-02-2020  12abc   (1,2)       …
10-02-2020  33bcd   (8,9)       …
14-02-2020  12abc   (1,2)       …
14-02-2020  12abc   (7,6)       …
14-02-2020  12abc   (3,4)       …
14-02-2020  33bcd   (6,7)       …
14-02-2020  33bcd   (8,9)       …
14-02-2020  33bcd   (1,1)       …
14-02-2020  33bcd   (2,2)       …
18-02-2020  12abc   (1,2)       …
19-02-2020  44hgf   (6,7)       …
20-02-2020  12abc   (3,5)       …
20-02-2020  12abc   (3,1)       …
20-02-2020  44hgf   (1,6)       …
20-02-2020  44hgf   (3,7)       …

required output:-

date       agent_id final_gps   ….
14-02-2020  12abc   (1,2)       …
14-02-2020  12abc   (7,6)       …
14-02-2020  12abc   (3,4)       …
14-02-2020  33bcd   (6,7)       …
14-02-2020  33bcd   (8,9)       …
20-02-2020  12abc   (3,5)       …
20-02-2020  12abc   (3,1)       …
20-02-2020  44hgf   (1,6)       …
20-02-2020  44hgf   (3,7)       …

I tried this but it is giving me all the matching records which exists in df2, but I want strictly data only for those agent_id on that particular date and particular gps matching condition from df1.

df = df1[df1['date'].isin(df2['date']) & 
         df1['agent_id'].isin(df2['agent_id']) & 
        (df1['final_gps'].isin(df2['gps1']) | df1['final_gps'].isin(df2['gps2']))]
1
  • Is same output of both solutions? Commented Mar 11, 2022 at 9:15

2 Answers 2

2

Use DataFrame.melt for reshape gps1 and gps2 to final_gps first, so possible merge by all 3 columns (not necessary define on), remove duplicates by all columns and last sorting:

df = (df2.melt(id_vars=['date','agent_id'],
               value_vars=['gps1','gps2'],
               value_name='final_gps')
        .drop('variable', axis=1)
        .merge(df1)
        .drop_duplicates()
        .sort_values(by=['date','agent_id'], ignore_index=True))
print (df)
         date agent_id final_gps
0  14-02-2020    12abc     (1,2)
1  14-02-2020    12abc     (3,4)
2  14-02-2020    12abc     (7,6)
3  14-02-2020    33bcd     (6,7)
4  14-02-2020    33bcd     (8,9)
5  20-02-2020    12abc     (3,5)
6  20-02-2020    12abc     (3,1)
7  20-02-2020    44hgf     (1,6)
8  20-02-2020    44hgf     (3,7)

Details:

print (df2.melt(id_vars=['date','agent_id'],
               value_vars=['gps1','gps2'],
               value_name='final_gps'))

          date agent_id variable final_gps
0   14-02-2020    12abc     gps1     (1,2)
1   14-02-2020    12abc     gps1     (3,4)
2   14-02-2020    33bcd     gps1     (6,7)
3   20-02-2020    44hgf     gps1     (1,6)
4   20-02-2020    12abc     gps1     (3,5)
5   20-02-2020    33bcd     gps1     (3,4)
6   21-02-2020    12abc     gps1     (4,5)
7   14-02-2020    12abc     gps2     (7,6)
8   14-02-2020    12abc     gps2     (7,6)
9   14-02-2020    33bcd     gps2     (8,9)
10  20-02-2020    44hgf     gps2     (3,7)
11  20-02-2020    12abc     gps2     (3,1)
12  20-02-2020    33bcd     gps2     (3,6)
13  21-02-2020    12abc     gps2     (5,4)
Sign up to request clarification or add additional context in comments.

5 Comments

Could you please confirm why you are not using column gps1 and gps2 in df2 while filtering data from df2
@pquest - because are melted to final_gps
@pquest - added ouput of melt to answer for see whats happens
Thanks..I will study about melt function..
the moment i saw “melt” i knew its jezreal
1

You could use multiple isin and chain them using & operator. Since final_gps can be either gps1 or gps2, we use | operator in brackets:

out = (df1[df1['date'].isin(df2['date']) & 
           df1['agent_id'].isin(df2['agent_id']) & 
           (df1['final_gps'].isin(df2['gps1']) | df1['final_gps'].isin(df2['gps2']))]
       .reset_index(drop=True))

Output:

         date agent_id final_gps ….
0  14-02-2020    12abc    (1, 2)  …
1  14-02-2020    12abc    (7, 6)  …
2  14-02-2020    12abc    (3, 4)  …
3  14-02-2020    33bcd    (6, 7)  …
4  14-02-2020    33bcd    (8, 9)  …
5  20-02-2020    12abc    (3, 5)  …
6  20-02-2020    12abc    (3, 1)  …
7  20-02-2020    44hgf    (1, 6)  …
8  20-02-2020    44hgf    (3, 7)  …

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.