1
for i in range( 1, len( df ) ):
    if df.loc[i]["identification"] == df.loc[i-1]["identification"] and df.loc[i]["date"] == df.loc[i-1]["date"]:
       df.loc[i,"duplicate"] = 1
    else:
       df.loc[i,"duplicate"] = 0

This simple for loop runs really slow when processing a dataframe of a big size.

Any suggestions?

1
  • 1
    Please provide more specifics: what is "slow" and what is a "big size". Commented Nov 15, 2016 at 20:59

2 Answers 2

3

Try to use a vectorized approach instead of looping:

df['duplicate'] = np.where((df.identification == df.identification.shift())
                           &
                           (df.date == df.date.shift()),
                           1,0)
Sign up to request clarification or add additional context in comments.

1 Comment

Great, this is really what I wanted, huge improvement in running time, thanks.
1

It looks like you are just checking if values are duplicated. In that case, you can use

df.sort_values(by=['identification', 'date'], inplace=True)
df['duplicate'] = df.duplicated(subset=['identification', 'date']).astype(int)

1 Comment

The sorting was already done, but your suggestion does work well too, thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.