How to vectorize (make use of pandas/numpy) instead of using a nested for loop

Question

I wish to efficiently use pandas (or numpy) instead of a nested for loop with an if statement to solve a particular problem. Here is a toy version:

Suppose I have the following two DataFrames

import pandas as pd
import numpy as np

dict1 = {'vals': [100,200], 'in': [0,1], 'out' :[1,3]}
df1 = pd.DataFrame(data=dict1)

dict2 = {'vals': [500,800,300,200], 'in': [0.1,0.5,2,4], 'out' :[0.5,2,4,5]}
df2 = pd.DataFrame(data=dict2)

Now I wish to loop through each row each dataframe and multiply the vals if a particular condition is met. This code works for what I want

ans = []

for i in range(len(df1)):
    for j in range(len(df2)):
        if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
            ans.append(df1['vals'][i]*df2['vals'][j])

np.sum(ans)

However, clearly this is very inefficient and in reality my DataFrames can have millions of entries making this unusable. I am also not making us of pandas or numpy efficient vector implementations. Does anyone have any ideas how to efficiently vectorize this nested loop?

I feel like this code is something akin to matrix multiplication so could progress be made utilising outer? It's the if condition that I'm finding hard to wedge in, as the if logic needs to compare each entry in df1 against all entries in df2.

This can be implemented as matrix form, but if as you said the dataframes have millions of lines then the cross matrices will be millions by millions. In such cases it's always a compensation between memory and cpu. You can vectorize the code at the cost of memory, or use itertooples for a slightly more efficient code. Vectorizing is most efficient when you have lots of smaller vectors, not two huge vectors. — anishtain4
– anishtain4, Commented Jun 20, 2018 at 16:29

Max Ghenis · Accepted Answer · 2020-07-13 18:27:36Z

5

You can also use a compiler like Numba to do this job. This would also outperform the vectorized solution and doesn't need a temporary array.

Example

import numba as nb
import numpy as np
import pandas as pd
import time

@nb.njit(fastmath=True,parallel=True,error_model='numpy')
def your_function(df1_in,df1_out,df1_vals,df2_in,df2_out,df2_vals):
  sum=0.
  for i in nb.prange(len(df1_in)):
      for j in range(len(df2_in)):
          if (df1_in[i] <= df2_out[j] and df1_out[i] >= df2_in[j]):
              sum+=df1_vals[i]*df2_vals[j]
  return sum

Testing

dict1 = {'vals': np.random.randint(1, 100, 1000),
         'in': np.random.randint(1, 10, 1000),
         'out': np.random.randint(1, 10, 1000)}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': np.random.randint(1, 100, 1500),
         'in': 5*np.random.random(1500),
         'out': 5*np.random.random(1500)}
df2 = pd.DataFrame(data=dict2)

# First call has some compilation overhead
res=your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
                  df2['in'].values, df2['out'].values, df2['vals'].values)

t1 = time.time()
for i in range(1000):
  res = your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
                      df2['in'].values, df2['out'].values, df2['vals'].values)

print(time.time() - t1)

Timings

vectorized solution @AGN Gazer: 9.15ms
parallelized Numba Version: 0.7ms

edited Jul 13, 2020 at 18:27

Max Ghenis

16k17 gold badges94 silver badges142 bronze badges

answered Jun 21, 2018 at 12:51

max9111

6,5171 gold badge19 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

mch56 Over a year ago

Thanks for introducing me to numba. This solution works great and is fast enough for usage.

AGN Gazer · Accepted Answer · 2018-06-20 17:41:39Z

4

m1 = np.less_equal.outer(df1['in'], df2['out']) 
m2 = np.greater_equal.outer(df1['out'], df2['in'])
m = np.logical_and(m1, m2)
v12 = np.outer(df1['vals'], df2['vals'])
print(v12[m].sum())

Or, replace first three lines with this long line:

m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
s = np.outer(df1['vals'], df2['vals'])[m].sum()

For very large problems, dask is recommended.

Timing Tests:

Here is a timing comparison when using 1000 and 1500-long arrays:

In [166]: dict1 = {'vals': np.random.randint(1,100,1000), 'in': np.random.randint(1,10,1000), 'out': np.random.randint(1,10,1000)}
     ...: df1 = pd.DataFrame(data=dict1)
     ...: 
     ...: dict2 = {'vals': np.random.randint(1,100,1500), 'in': 5*np.random.random(1500), 'out': 5*np.random.random(1500)}
     ...: df2 = pd.DataFrame(data=dict2)

Author's original method (Python loops):

In [167]: def f(df1, df2):
     ...:     ans = []
     ...:     for i in range(len(df1)):
     ...:         for j in range(len(df2)):
     ...:             if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
     ...:                 ans.append(df1['vals'][i]*df2['vals'][j])
     ...:     return np.sum(ans)
     ...: 
     ...: 

In [168]: %timeit f(df1, df2)
47.3 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

@Ben.T method:

In [170]: %timeit df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) & (df1['out'] >= row['in'])].sum()*row['vals'],1); df2['a
     ...: ns'].sum()
2.22 s ± 40.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Vectorized solution proposed here:

In [171]: def g(df1, df2):
     ...:     m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
     ...:     return np.outer(df1['vals'], df2['vals'])[m].sum()
     ...: 
     ...: 

In [172]: %timeit g(df1, df2)

7.81 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Jun 20, 2018 at 17:41

answered Jun 20, 2018 at 15:53

AGN Gazer

8,4272 gold badges31 silver badges49 bronze badges

2 Comments

DSM Over a year ago

If df1/df2 are really millions of rows long, m1 will be far too large to materialize.

AGN Gazer Over a year ago

@DSM Possibly, but this is a solution that vectorizes the problem (at least for less data). What are the alternative solutions, really? Maybe Numba to compile OP loop? Or try dask if it can work with very large data.

Dillon · Accepted Answer · 2018-06-20 16:32:14Z

2

Your answer:

471 µs ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Method 1 (3+ times slower):

df1.apply(lambda row: list((df2['vals'][(row['in'] <= df2['out']) & (row['out'] >= df2['in'])] * row['vals'])), axis=1).sum()

1.56 ms ± 7.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Method 2 (2+ times slower):

ans = []
for name, row in df1.iterrows():
    _in = row['in']
    _out = row['out']
    _vals = row['vals']
    ans.append(df2['vals'].loc[(df2['in'] <= _out) & (df2['out'] >= _in)].values * _vals)

1.01 ms ± 8.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Method 3 (3+ times faster):

df1_vals = df1.values
ans = np.zeros(shape=(len(df1_vals), len(df2.values)))
for i in range(df1_vals.shape[0]):
    df2_vals = df2.values
    df2_vals[:, 2][~np.logical_and(df1_vals[i, 1] >= df2_vals[:, 0], df1_vals[i, 0] <= df2_vals[:, 1])] = 0
    ans[i, :] = df2_vals[:, 2] * df1_vals[i, 2]

144 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In Method 3 you can view the solution by performing:

ans[ans.nonzero()]

Out[]: array([ 50000.,  80000., 160000.,  60000.]

I wasn't able to think of a way to remove the underlying loop :( but I learnt a lot about numpy in the process! (yay for learning)

answered Jun 20, 2018 at 16:32

Dillon

9994 silver badges13 bronze badges

6 Comments

AGN Gazer Over a year ago

When performing timing comparisons, one needs to use large data: otherwise you'll be measuring overhead of array creation, etc.

Dillon Over a year ago

@AGNGazer Thanks for the advice, that makes sense! One quick question whilst I have your attention, in my method 3 I assigned an array of zeros with the maximum possible shape, to then shrink it down at the end if any zeros still exist. The intention is for the numpy array to occupy that space in memory only once and use ans[i, :] = ... to fill the results. Is this 'numpythonic' at all?

AGN Gazer Over a year ago

I am not a specialist in "numpthonics" or "pythonics". I think code (at least that I write at my job) must be efficient and easily readable/maintainable. Everything else does not matter to me (too much; I do enjoy elegant solutions, of course). I do not see anything wrong with pre-alocating more memory when you cannot guess exactly how much you need.

Dillon Over a year ago

@AGNGazer That's a great attitude to have! Are there any sources you would point to in order for me to learn the efficient solutions of pandas/numpy or do you think learning from answers like these is a strong start?

AGN Gazer Over a year ago

To be honest, I did not use any other resources, and would not recommend any. I think if you try to understand other people solutions AND try to provide answers on SO to people questions (even if at the beginning it will take you longer - don't pay attention to time: the end result and "getting there" are the most important things) are the best things to learn programmig in general and packages such as pandas, numpy, etc.

|

Ben.T · Accepted Answer · 2018-06-20 15:25:34Z

-1

One way to do it is by using apply. Create a column in df2 containing the sum of vals in df1, meeting your criteria on in and out, multiplied by the vals of the row of df2

df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) & 
                                              (df1['out'] >= row['in'])].sum()*row['vals'],1)

then just sum this column

df2['ans'].sum()

answered Jun 20, 2018 at 15:25

Ben.T

29.7k6 gold badges39 silver badges57 bronze badges

4 Comments

DSM Over a year ago

I'm not sure what this changes. apply is just a loop under the hood, and the approach is still O(N^2).

Ben.T Over a year ago

@DSM sure but I think it's still faster than two loops for?

DSM Over a year ago

I think you're right that the partially vectorized comparison will be faster, but I don't think it'll help enough at the ~1M rows scale, unfort.

mch56 Over a year ago

So I have tried this and it's still very slow even when the DataFrames are only 10000 entries. I think @DSM is correct about apply just hiding a for loop under the hood?

Collectives™ on Stack Overflow

How to vectorize (make use of pandas/numpy) instead of using a nested for loop

4 Answers 4

1 Comment

Timing Tests:

Author's original method (Python loops):

@Ben.T method:

Vectorized solution proposed here:

2 Comments

Method 1 (3+ times slower):

Method 2 (2+ times slower):

Method 3 (3+ times faster):

6 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Timing Tests:

Author's original method (Python loops):

@Ben.T method:

Vectorized solution proposed here:

2 Comments

Method 1 (3+ times slower):

Method 2 (2+ times slower):

Method 3 (3+ times faster):

6 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related