Python 'for' loop performance too slow

Question

I have over 500,000 rows in my dataframe and a number of similar 'for' loops which are causing my code to take over a hour to complete its computation. Is there a more efficient way of writing the following 'for' loop so that things run a lot faster:

col_26 = []
col_27 = []
col_28 = []


for ind in df.index:
    if df['A_factor'][ind] > df['B_factor'][ind]:
        col_26.append('Yes')
        col_27.append('No')
        col_28.append(df['A_value'][ind])
    elif df['A_factor'][ind] < df['B_factor'][ind]:
        col_26.append('No')
        col_27.append('Yes')
        col_28.append(df['B_value'][ind])
    else:
        col_26.append('')
        col_27.append('')
        col_28.append(float('nan'))

A for loop of 500,000 items runs in less than a second. So it not the for loop that causes the trouble. — Klaus D.
– Klaus D., Commented Jul 27, 2020 at 21:09
Likely things will be monumentally faster if done in Pandas or NumPy... — dawg
– dawg, Commented Jul 27, 2020 at 21:11
Can you provide more information? More code? You might be using a ton of memory if you are creating many 500,000 length lists and that creates the slow down and it's not a cpu problem. — Ian Wilson
– Ian Wilson, Commented Jul 27, 2020 at 21:15

Friso Harlaar · Accepted Answer · 2020-07-27 21:12:47Z

1

You might want to look into the pandas iterrows() function or using apply, you can look at this article aswell: https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06

answered Jul 27, 2020 at 21:12

Friso Harlaar

314 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Pramote Kuacharoen · Accepted Answer · 2020-07-27 21:28:06Z

1

Try column operations:

data = {'A_factor': [1, 2, 3, 4, 5],
        'A_value': [10, 20, 30, 40, 50],
           'B_factor': [2, 3, 1, 2, 6],
        'B_value': [11, 22, 33, 44, 55]}
df = pd.DataFrame(data)
df['col_26'] = ''
df['col_27'] = ''
df['col_28'] = np.nan

mask = df['A_factor'] > df['B_factor']
df.loc[mask, 'col_26'] = 'Yes'
df.loc[~mask, 'col_26'] = 'No'
df.loc[mask, 'col_28'] = df[mask]['A_value']

df.loc[~mask, 'col_27'] = 'Yes'
df.loc[mask, 'col_27'] = 'No'
df.loc[~mask, 'col_28'] = df[~mask]['B_value']

answered Jul 27, 2020 at 21:28

Pramote Kuacharoen

1,5511 gold badge7 silver badges6 bronze badges

Comments

pho · Accepted Answer · 2020-07-27 21:27:15Z

0

Appending to lists in Python is painfully slow. Initializing the lists before the iteration can speed things up. For example,

def f():
    x = []
    for ii in range(500000):
        x.append(str(x))

def f2():
    x = [""] * 500000
    for ii in range(500000):
        x[ii] = str(x)


timeit.timeit("f()", "from __main__ import f", number=10)
# Output: 1.6317970999989484
timeit.timeit("f2()", "from __main__ import f2", number=10)
# Output: 1.3037318000024243

Since you're already using pandas / numpy, there are ways to vectorize your operations so they don't need looping. For example:

a_factor = df["A_factor"].to_numpy()
b_factor = df["B_factor"].to_numpy()

col_26 = np.empty(a_factor.shape, dtype='U3') # U3 => string of size 3
col_27 = np.empty(a_factor.shape, dtype='U3')
col_28 = np.empty(a_factor.shape)

a_greater = a_factor > b_factor
b_greater = a_factor < b_factor
both_equal = a_factor == b_factor

col_26[a_greater] = 'Yes'
col_26[b_greater] = 'No'

col_27[a_greater] = 'Yes'
col_27[b_greater] = 'No'

col_28[a_greater] = a_factor[a_greater]
col_28[b_greater] = b_factor[b_greater]
col_28[both_equal] = np.nan

edited Jul 27, 2020 at 21:27

answered Jul 27, 2020 at 21:16

pho

25.7k8 gold badges48 silver badges75 bronze badges

1 Comment

windwalker Over a year ago

thanks for taking the time to provide that example. I shall also look at vectorization, I am very new to this, but here to learn. Thanks again

Ali Fallah · Accepted Answer · 2020-07-27 21:42:37Z

0

append causes python requests for heap memory to get more memory. using append in for loop causes get memory and free it continually to get more memory. so it's better to say to python how many item you need.

col_26 = [True]*500000
col_27 = [False]*500000
col_28 = [float('nan')]*500000

for ind in df.index:
    if df['A_factor'][ind] > df['B_factor'][ind]:
        col_28[ind] = df['A_value'][ind]
    elif df['A_factor'][ind] < df['B_factor'][ind]:
        col_26[ind] = False
        col_27[ind] = True
        col_28[ind] = df['B_value'][ind]
    else:
        col_26[ind] = ''
        col_27[ind] = ''

edited Jul 27, 2020 at 21:42

answered Jul 27, 2020 at 21:35

Ali Fallah

1389 bronze badges

Collectives™ on Stack Overflow

Python 'for' loop performance too slow

4 Answers 4

Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related