1

I am new to Python and need a bit of help with comparing data from two different dataframes.

What I'm trying to do is compare a column "New" from the second_dataset (dataframe) with the "New" column from the first_dataset (dataframe). If the value in the row in second_dataset exists in first_dataset, I would like to add in a "Status column" and add the string "Yes" to it else I want it to say "No". I've copied my code below.

So far I've tried a few things but keep getting an error. Any suggestions would be helpful. Please.

for row in second_dataset["New"]:
if row in first_dataset["New"] == second_dataset["New"]:
    second_dataset["Status"] = "Yes"
elif row != first_dataset["New"]:
    second_dataset["Status"] = "No"
else:
    second_dataset["Status"] = "Error"
1
  • Check timings in my answer, list comprehension in another answer is not recommended, because slow. Commented Jun 12, 2018 at 11:30

2 Answers 2

0

I believe need compare columns by isin and set new values by numpy.where:

first_dataset = pd.DataFrame({'New': [5,6,7,8,10]})
second_dataset = pd.DataFrame({'New': [1,4,5]})
print (first_dataset)
   New
0    5
1    6
2    7
3    8
4   10

print (second_dataset)
   New
0    1
1    4
2    5

mask = second_dataset["New"].isin(first_dataset["New"])
second_dataset['Status'] = np.where(mask, 'Yes', 'No')
print (second_dataset)
   New Status
0    1     No
1    4     No
2    5    Yes

Detail:

print (mask)
0    False
1    False
2     True
Name: New, dtype: bool

Timings:

np.random.seed(123)
first_dataset = pd.DataFrame({'New': np.random.randint(100, size=500)})
second_dataset = pd.DataFrame({'New': np.random.randint(100, size=1000)})
print (first_dataset)

second_dataset['status_column'] = ['Yes' if x in first_dataset['New'].tolist() else 'No' for x in second_dataset['New'].tolist()]

second_dataset['Status'] = np.where(second_dataset["New"].isin(first_dataset["New"]), 'Yes', 'No')

In [146]: %timeit second_dataset['status_column'] = ['Yes' if x in first_dataset['New'].tolist() else 'No' for x in second_dataset['New'].tolist()]
20.9 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [147]: %timeit second_dataset['Status'] = np.where(second_dataset["New"].isin(first_dataset["New"]), 'Yes', 'No')
455 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sign up to request clarification or add additional context in comments.

Comments

0
import pandas as pd
dd1 = {'New': [1,2,3], 'b':[4,5,6]}
dd2 = {'New': [1,2,3], 'b':[4,5,6]}

df1 = pd.DataFrame(dd1)
df2 = pd.DataFrame(dd2)

df1_new = df1['New'].tolist()
df2_new = df2['New'].tolist()
print(df1_new)

df2_status = ['Yes' if x in df1_new else 'No' for x in df2_new]
dd2['status_column'] = df2_status

df2 = pd.DataFrame(dd2)
print(df2)

5 Comments

No, in pandas list comprehension is not recommended, because slow.
It depends on what is in the df column. If the column contains strings then list operations may be faster than apply function on the df column. E.g. you can do df column with string to list then do pool.imap of such list and it will be faster than df.apply function on that column.
Yes, but I dont say about apply which contains loop under the hood. I say about vectorized functions like np.where.
You did not say it with the full respect to you. you just said "no, ..." hence why I used the example as the argument against your lame man's terms. Peace.
hmm, list comrehension is possible use if no NaNs, it is faster with strings. But here not, simply because exist better solutions.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.