1

I tow or three DataFrames that have duplicated rows.

In [31]: df1
Out[31]: 
    member           time
0       0 2009-09-30 12:00:00
1       0 2009-09-30 18:00:00
2       0 2009-10-01 00:00:00
3       1 2009-09-30 12:00:00
4       1 2009-09-30 18:00:00
5       2 2009-09-30 12:00:00
6       3 2009-09-30 12:00:00
...

In [32]: df2
Out[32]: 
    member           time
0       0 2009-09-30 12:00:00
1       0 2009-09-30 18:00:00
3       1 2009-09-30 12:00:00
4       2 2009-09-30 12:00:00
5       2 2009-09-30 18:00:00
6       2 2009-10-01 00:00:00
...

I'd like to filter out the rows that have unique value of 'member' and 'time' from df1 and df2, and get a DataFrame that has only rows that have the common value of 'member' and 'time' in df1 and df2, that is

In [33]: df_duplicated_1_and_2
Out[33]: 
    member           time
0       0 2009-09-30 12:00:00
1       0 2009-09-30 18:00:00
3       1 2009-09-30 12:00:00
4       2 2009-09-30 12:00:00
...

Is there a efficient and elegant way to do this ?

Update If possible, I'd like to get not a new merged DataFrame but a filtered DataFrame. e.g.,

In [34]: df1
Out[34]: 
    member           time           value
0       0 2009-09-30 12:00:00  a
1       0 2009-09-30 18:00:00  b
2       0 2009-10-01 00:00:00  c
3       1 2009-09-30 12:00:00  d
4       1 2009-09-30 18:00:00  e
5       2 2009-09-30 12:00:00  f
6       3 2009-09-30 12:00:00  g
...

In [35]: df1_filtered_out
Out[35]: 
    member           time           value
0       0 2009-09-30 12:00:00  a
1       0 2009-09-30 18:00:00  b
3       1 2009-09-30 12:00:00  d
4       2 2009-09-30 12:00:00  g
...

and also get filtered df2.

1 Answer 1

3

Do a inner join on member and time columns:

>>> df1.merge(df2, on=['member', 'time'], how='inner')
   member                time
0       0 2009-09-30 12:00:00
1       0 2009-09-30 18:00:00
2       1 2009-09-30 12:00:00
3       2 2009-09-30 12:00:00

This will produce a result that has only the rows that have the same member and time values in both DataFrames.

Update:

>>> df1.merge(df2[['member', 'time']])
   member                time value
0       0 2009-09-30 12:00:00     a
1       0 2009-09-30 18:00:00     b
2       1 2009-09-30 12:00:00     d
3       2 2009-09-30 12:00:00     f
Sign up to request clarification or add additional context in comments.

7 Comments

Merges are 'inner' by default so the how parameter is not necessary.
@EdChum I know, but I explicitly specified the how parameter to show the OP how can he change this behavior to right, left or outer if he decides to do a different thing. But yes, this is a useful comment. +1.
Thanks for your answer and comments. Your answer is the almost same as what I'd like to do, but I'd like to get 'filtered' DataFrame, not 'merged'. Could you tell me the way to filter out duplicated raw? (Updated my question)
@Tetsuro the answer is the same. Just select out the columns from the df2 frame: df1.merge(df2[['member', 'time']])
@Tetsuro Also since this is boolean indexing, you cannot get a view, you will, no meter what you do, get a copy.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.