1

I have over 500,000 rows in my dataframe and a number of similar 'for' loops which are causing my code to take over a hour to complete its computation. Is there a more efficient way of writing the following 'for' loop so that things run a lot faster:

col_26 = []
col_27 = []
col_28 = []


for ind in df.index:
    if df['A_factor'][ind] > df['B_factor'][ind]:
        col_26.append('Yes')
        col_27.append('No')
        col_28.append(df['A_value'][ind])
    elif df['A_factor'][ind] < df['B_factor'][ind]:
        col_26.append('No')
        col_27.append('Yes')
        col_28.append(df['B_value'][ind])
    else:
        col_26.append('')
        col_27.append('')
        col_28.append(float('nan'))
4
  • 1
    A for loop of 500,000 items runs in less than a second. So it not the for loop that causes the trouble. Commented Jul 27, 2020 at 21:09
  • Likely things will be monumentally faster if done in Pandas or NumPy... Commented Jul 27, 2020 at 21:11
  • Use column operations. Commented Jul 27, 2020 at 21:14
  • Can you provide more information? More code? You might be using a ton of memory if you are creating many 500,000 length lists and that creates the slow down and it's not a cpu problem. Commented Jul 27, 2020 at 21:15

4 Answers 4

1

You might want to look into the pandas iterrows() function or using apply, you can look at this article aswell: https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06

Sign up to request clarification or add additional context in comments.

Comments

1

Try column operations:

data = {'A_factor': [1, 2, 3, 4, 5],
        'A_value': [10, 20, 30, 40, 50],
           'B_factor': [2, 3, 1, 2, 6],
        'B_value': [11, 22, 33, 44, 55]}
df = pd.DataFrame(data)
df['col_26'] = ''
df['col_27'] = ''
df['col_28'] = np.nan

mask = df['A_factor'] > df['B_factor']
df.loc[mask, 'col_26'] = 'Yes'
df.loc[~mask, 'col_26'] = 'No'
df.loc[mask, 'col_28'] = df[mask]['A_value']

df.loc[~mask, 'col_27'] = 'Yes'
df.loc[mask, 'col_27'] = 'No'
df.loc[~mask, 'col_28'] = df[~mask]['B_value']

Comments

0

Appending to lists in Python is painfully slow. Initializing the lists before the iteration can speed things up. For example,

def f():
    x = []
    for ii in range(500000):
        x.append(str(x))

def f2():
    x = [""] * 500000
    for ii in range(500000):
        x[ii] = str(x)


timeit.timeit("f()", "from __main__ import f", number=10)
# Output: 1.6317970999989484
timeit.timeit("f2()", "from __main__ import f2", number=10)
# Output: 1.3037318000024243

Since you're already using pandas / numpy, there are ways to vectorize your operations so they don't need looping. For example:

a_factor = df["A_factor"].to_numpy()
b_factor = df["B_factor"].to_numpy()

col_26 = np.empty(a_factor.shape, dtype='U3') # U3 => string of size 3
col_27 = np.empty(a_factor.shape, dtype='U3')
col_28 = np.empty(a_factor.shape)

a_greater = a_factor > b_factor
b_greater = a_factor < b_factor
both_equal = a_factor == b_factor

col_26[a_greater] = 'Yes'
col_26[b_greater] = 'No'

col_27[a_greater] = 'Yes'
col_27[b_greater] = 'No'

col_28[a_greater] = a_factor[a_greater]
col_28[b_greater] = b_factor[b_greater]
col_28[both_equal] = np.nan

1 Comment

thanks for taking the time to provide that example. I shall also look at vectorization, I am very new to this, but here to learn. Thanks again
0

append causes python requests for heap memory to get more memory. using append in for loop causes get memory and free it continually to get more memory. so it's better to say to python how many item you need.

col_26 = [True]*500000
col_27 = [False]*500000
col_28 = [float('nan')]*500000

for ind in df.index:
    if df['A_factor'][ind] > df['B_factor'][ind]:
        col_28[ind] = df['A_value'][ind]
    elif df['A_factor'][ind] < df['B_factor'][ind]:
        col_26[ind] = False
        col_27[ind] = True
        col_28[ind] = df['B_value'][ind]
    else:
        col_26[ind] = ''
        col_27[ind] = ''

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.