Identifying duplicate records on Python in Dataframes [duplicate]

Question

I am new to Python and need a bit of help with comparing data from two different dataframes.

What I'm trying to do is compare a column "New" from the second_dataset (dataframe) with the "New" column from the first_dataset (dataframe). If the value in the row in second_dataset exists in first_dataset, I would like to add in a "Status column" and add the string "Yes" to it else I want it to say "No". I've copied my code below.

So far I've tried a few things but keep getting an error. Any suggestions would be helpful. Please.

for row in second_dataset["New"]:
if row in first_dataset["New"] == second_dataset["New"]:
    second_dataset["Status"] = "Yes"
elif row != first_dataset["New"]:
    second_dataset["Status"] = "No"
else:
    second_dataset["Status"] = "Error"

Check timings in my answer, list comprehension in another answer is not recommended, because slow. — jezrael
– jezrael, Commented Jun 12, 2018 at 11:30

jezrael · Accepted Answer · 2018-06-12 11:29:54Z

I believe need compare columns by isin and set new values by numpy.where:

first_dataset = pd.DataFrame({'New': [5,6,7,8,10]})
second_dataset = pd.DataFrame({'New': [1,4,5]})
print (first_dataset)
   New
0    5
1    6
2    7
3    8
4   10

print (second_dataset)
   New
0    1
1    4
2    5

mask = second_dataset["New"].isin(first_dataset["New"])
second_dataset['Status'] = np.where(mask, 'Yes', 'No')
print (second_dataset)
   New Status
0    1     No
1    4     No
2    5    Yes

Detail:

print (mask)
0    False
1    False
2     True
Name: New, dtype: bool

Timings:

np.random.seed(123)
first_dataset = pd.DataFrame({'New': np.random.randint(100, size=500)})
second_dataset = pd.DataFrame({'New': np.random.randint(100, size=1000)})
print (first_dataset)

second_dataset['status_column'] = ['Yes' if x in first_dataset['New'].tolist() else 'No' for x in second_dataset['New'].tolist()]

second_dataset['Status'] = np.where(second_dataset["New"].isin(first_dataset["New"]), 'Yes', 'No')

In [146]: %timeit second_dataset['status_column'] = ['Yes' if x in first_dataset['New'].tolist() else 'No' for x in second_dataset['New'].tolist()]
20.9 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [147]: %timeit second_dataset['Status'] = np.where(second_dataset["New"].isin(first_dataset["New"]), 'Yes', 'No')
455 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

mothership · Accepted Answer · 2018-06-12 11:15:58Z

0

import pandas as pd
dd1 = {'New': [1,2,3], 'b':[4,5,6]}
dd2 = {'New': [1,2,3], 'b':[4,5,6]}

df1 = pd.DataFrame(dd1)
df2 = pd.DataFrame(dd2)

df1_new = df1['New'].tolist()
df2_new = df2['New'].tolist()
print(df1_new)

df2_status = ['Yes' if x in df1_new else 'No' for x in df2_new]
dd2['status_column'] = df2_status

df2 = pd.DataFrame(dd2)
print(df2)

answered Jun 12, 2018 at 11:15

mothership

431 silver badge8 bronze badges

5 Comments

jezrael Over a year ago

No, in pandas list comprehension is not recommended, because slow.

mothership Over a year ago

It depends on what is in the df column. If the column contains strings then list operations may be faster than apply function on the df column. E.g. you can do df column with string to list then do pool.imap of such list and it will be faster than df.apply function on that column.

jezrael Over a year ago

Yes, but I dont say about apply which contains loop under the hood. I say about vectorized functions like np.where.

mothership Over a year ago

You did not say it with the full respect to you. you just said "no, ..." hence why I used the example as the argument against your lame man's terms. Peace.

jezrael Over a year ago

hmm, list comrehension is possible use if no NaNs, it is faster with strings. But here not, simply because exist better solutions.

Collectives™ on Stack Overflow

Identifying duplicate records on Python in Dataframes [duplicate]

2 Answers 2

Comments

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Linked

Related