3

I have a dataframe (df) with distance traveled and I have assigned a label based on certain conditions.

distance=[0,0.0001,0.20,1.23,4.0]
df = pd.DataFrame(distance,columns=["distance"])
df['label']=0
for i in range(0, len(df['distance'])):   
      if (df['distance'].values[i])<=0.10:
          df['label'][i]=1
      elif (df['distance'].values[i])<=0.50:
          df['label'][i]=2
      elif (df['distance'].values[i])>0.50:
          df['label'][i]=3

This is working fine. However, I have more than 1 million records with distance and this for loop is taking longer time than expected. Can we optimize this code to reduce the execution time?

3
  • 2
    i think you could shave off micro seconds if the last elif... became just an else: Commented Sep 9, 2016 at 16:36
  • presumably your second elif should b 0.10 < df['distance'].values[i])<=0.50? I'd probably create a new dataframe column for each condition and then merge them, slicing then broadcasting should be quicker than looping Commented Sep 9, 2016 at 16:38
  • Two things: how das df['label'][i] = 1 not create an error, if you set df['label'] to 0? And: don't know if you use python2 or python3 - but foor python2 replace range with xrange Commented Sep 9, 2016 at 16:39

3 Answers 3

3

In general, you shouldn't loop over DataFrames unless it's absolutely necessary. You'll usually get much better performance using a built-in Pandas function that's already been optimized, or by using a vectorized approach.

In this case, you can use loc and Boolean indexing to do the assignments:

# Initialize as 1 (eliminate need to check the first condition).
df['label'] = 1

# Case 1: Between 0.1 and 0.5
df.loc[(df['distance'] > 0.1) & (df['distance'] <= 0.5), 'label'] = 2

# Case 2: Greater than 0.5
df.loc[df['distance'] > 0.5, 'label'] = 3

Another option is to use pd.cut. This is a method is a little more specialized to the example problem in the question. Boolean indexing is a more general method.

# Get the low and high bins.
low, high = df['distance'].min()-1, df['distance'].max()+1

# Perform the cut.  Add one since the labels start at zero by default.
df['label'] = pd.cut(df['distance'], bins=[low, 0.1, 0.5, high], labels=False) + 1

You could also use labels=[1,2,3] in the code above, and not add 1 to the result. This would give df['labels'] categorical dtype instead of integer dtype though. Depending on your use case this may or may not be important.

The resulting output for either method:

   distance  label
0    0.0000      1
1    0.0001      1
2    0.2000      2
3    1.2300      3
4    4.0000      3
Sign up to request clarification or add additional context in comments.

Comments

1

Use cut by assigning labels to the bins:

pd.cut(df.distance, [-np.inf, 0.1, 0.5, np.inf], labels=[1,2,3])

0    1
1    1
2    2
3    3
4    3

Comments

0

Comes with a warning about setting a value on a copy of a slice, but maybe someone can suggest a cleaner alternative?

Just based on fancy indexing to get the sub-array based on distance and then writing the values you want to it.

df.loc[:, "label"][df.loc[:, "distance"] <= 0.1] = 1
df.loc[:, "label"][(0.1 < df.loc[:, "distance"]) & (df.loc[:, "distance"] <= 0.5)] = 2
df.loc[:, "label"][df.loc[:, "distance"] > 0.5] = 3

EDIT: New and improved, without chained indexing.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.