Optimizing for loop in python

Question

I have a dataframe (df) with distance traveled and I have assigned a label based on certain conditions.

distance=[0,0.0001,0.20,1.23,4.0]
df = pd.DataFrame(distance,columns=["distance"])
df['label']=0
for i in range(0, len(df['distance'])):   
      if (df['distance'].values[i])<=0.10:
          df['label'][i]=1
      elif (df['distance'].values[i])<=0.50:
          df['label'][i]=2
      elif (df['distance'].values[i])>0.50:
          df['label'][i]=3

This is working fine. However, I have more than 1 million records with distance and this for loop is taking longer time than expected. Can we optimize this code to reduce the execution time?

i think you could shave off micro seconds if the last elif... became just an else: — depperm
– depperm, Commented Sep 9, 2016 at 16:36
presumably your second elif should b 0.10 < df['distance'].values[i])<=0.50? I'd probably create a new dataframe column for each condition and then merge them, slicing then broadcasting should be quicker than looping — Andrew
– Andrew, Commented Sep 9, 2016 at 16:38
Two things: how das df['label'][i] = 1 not create an error, if you set df['label'] to 0? And: don't know if you use python2 or python3 - but foor python2 replace range with xrange — kratenko
– kratenko, Commented Sep 9, 2016 at 16:39

root · Accepted Answer · 2016-09-09 17:19:12Z

In general, you shouldn't loop over DataFrames unless it's absolutely necessary. You'll usually get much better performance using a built-in Pandas function that's already been optimized, or by using a vectorized approach.

In this case, you can use loc and Boolean indexing to do the assignments:

# Initialize as 1 (eliminate need to check the first condition).
df['label'] = 1

# Case 1: Between 0.1 and 0.5
df.loc[(df['distance'] > 0.1) & (df['distance'] <= 0.5), 'label'] = 2

# Case 2: Greater than 0.5
df.loc[df['distance'] > 0.5, 'label'] = 3

Another option is to use pd.cut. This is a method is a little more specialized to the example problem in the question. Boolean indexing is a more general method.

# Get the low and high bins.
low, high = df['distance'].min()-1, df['distance'].max()+1

# Perform the cut.  Add one since the labels start at zero by default.
df['label'] = pd.cut(df['distance'], bins=[low, 0.1, 0.5, high], labels=False) + 1

You could also use labels=[1,2,3] in the code above, and not add 1 to the result. This would give df['labels'] categorical dtype instead of integer dtype though. Depending on your use case this may or may not be important.

The resulting output for either method:

   distance  label
0    0.0000      1
1    0.0001      1
2    0.2000      2
3    1.2300      3
4    4.0000      3

Zeugma · Accepted Answer · 2016-09-09 17:24:47Z

1

Use cut by assigning labels to the bins:

pd.cut(df.distance, [-np.inf, 0.1, 0.5, np.inf], labels=[1,2,3])

0    1
1    1
2    2
3    3
4    3

answered Sep 9, 2016 at 17:24

Zeugma

32.3k9 gold badges73 silver badges85 bronze badges

Comments

Andrew · Accepted Answer · 2016-09-09 17:18:43Z

0

Comes with a warning about setting a value on a copy of a slice, but maybe someone can suggest a cleaner alternative?

Just based on fancy indexing to get the sub-array based on distance and then writing the values you want to it.

df.loc[:, "label"][df.loc[:, "distance"] <= 0.1] = 1
df.loc[:, "label"][(0.1 < df.loc[:, "distance"]) & (df.loc[:, "distance"] <= 0.5)] = 2
df.loc[:, "label"][df.loc[:, "distance"] > 0.5] = 3

EDIT: New and improved, without chained indexing.

edited Sep 9, 2016 at 17:18

answered Sep 9, 2016 at 17:02

Andrew

1,0821 gold badge7 silver badges15 bronze badges

Collectives™ on Stack Overflow

Optimizing for loop in python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related