How to avoid slow for(): loops, when using Pandas dataframe?

Question

for i in range( 1, len( df ) ):
    if df.loc[i]["identification"] == df.loc[i-1]["identification"] and df.loc[i]["date"] == df.loc[i-1]["date"]:
       df.loc[i,"duplicate"] = 1
    else:
       df.loc[i,"duplicate"] = 0

This simple for loop runs really slow when processing a dataframe of a big size.

Any suggestions?

Please provide more specifics: what is "slow" and what is a "big size". — Danra
– Danra, Commented Nov 15, 2016 at 20:59

user3666197 · Accepted Answer · 2016-12-17 19:46:29Z

3

Try to use a vectorized approach instead of looping:

df['duplicate'] = np.where((df.identification == df.identification.shift())
                           &
                           (df.date == df.date.shift()),
                           1,0)

edited Dec 17, 2016 at 19:46

user3666197

1

answered Nov 15, 2016 at 20:54

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Gursel Karacor Over a year ago

Great, this is really what I wanted, huge improvement in running time, thanks.

kgully · Accepted Answer · 2016-11-15 20:56:55Z

1

It looks like you are just checking if values are duplicated. In that case, you can use

df.sort_values(by=['identification', 'date'], inplace=True)
df['duplicate'] = df.duplicated(subset=['identification', 'date']).astype(int)

answered Nov 15, 2016 at 20:56

kgully

6827 silver badges16 bronze badges

1 Comment

Gursel Karacor Over a year ago

The sorting was already done, but your suggestion does work well too, thank you.

Collectives™ on Stack Overflow

How to avoid slow for(): loops, when using Pandas dataframe?

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related