0

I'm trying to build a machine learning algorithm for my job. The data I'm using for training and testing has 17k rows and 20 columns. I've tried adding a new column based on two other columns but the for loop that I've written is too slow (3 seconds to be executed)

for i in range(0, len(model_olculeri)):
    if (model_olculeri["Bel"][i] != 0) and (model_olculeri["Basen"][i] != 0):
        sum_column = (model_olculeri["Bel"][i]) / (model_olculeri["Basen"][i])
        model_olculeri["Waist to Hip Ratio"][i] = sum_column

I read articles about pandas and numpy vectorization instead of for loop on pandas dataframes and it seems like it is so much faster and effective. How can I implement this kind of vectorization for my for loop? Thanks a lot.

2
  • Yes, looping over each row is generally slow, especially if you have an operation (in this case, division) that you want to apply on an entire column. Commented Oct 25, 2021 at 13:48
  • pandas.pydata.org/pandas-docs/stable/user_guide/… Commented Oct 25, 2021 at 13:52

2 Answers 2

1

Create boolean mask and use it for filtering:

m = (model_olculeri["Bel"] != 0) & (model_olculeri["Basen"] != 0)
model_olculeri.loc[m,"Waist to Hip Ratio"] = model_olculeri.loc[m, "Bel"] / model_olculeri.loc[m,"Basen"]

Alternative:

model_olculeri.loc[m,"Waist to Hip Ratio"] = model_olculeri["Bel"] / model_olculeri["Basen"]

Or set new value in numpy.where:

model_olculeri["Waist to Hip Ratio"] = np.where(m, model_olculeri["Bel"] / model_olculeri["Basen"], np.nan)
Sign up to request clarification or add additional context in comments.

Comments

0

Chained solution using query and pipe

model_olculeri.query("Bel != 0 & Basen != 0").pipe(lambda x:x.assign(Waist to Hip Ratio =  x.Bel/x.Basen)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.