3

I wish to efficiently use pandas (or numpy) instead of a nested for loop with an if statement to solve a particular problem. Here is a toy version:

Suppose I have the following two DataFrames

import pandas as pd
import numpy as np

dict1 = {'vals': [100,200], 'in': [0,1], 'out' :[1,3]}
df1 = pd.DataFrame(data=dict1)

dict2 = {'vals': [500,800,300,200], 'in': [0.1,0.5,2,4], 'out' :[0.5,2,4,5]}
df2 = pd.DataFrame(data=dict2)

Now I wish to loop through each row each dataframe and multiply the vals if a particular condition is met. This code works for what I want

ans = []

for i in range(len(df1)):
    for j in range(len(df2)):
        if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
            ans.append(df1['vals'][i]*df2['vals'][j])

np.sum(ans)

However, clearly this is very inefficient and in reality my DataFrames can have millions of entries making this unusable. I am also not making us of pandas or numpy efficient vector implementations. Does anyone have any ideas how to efficiently vectorize this nested loop?

I feel like this code is something akin to matrix multiplication so could progress be made utilising outer? It's the if condition that I'm finding hard to wedge in, as the if logic needs to compare each entry in df1 against all entries in df2.

1
  • 1
    This can be implemented as matrix form, but if as you said the dataframes have millions of lines then the cross matrices will be millions by millions. In such cases it's always a compensation between memory and cpu. You can vectorize the code at the cost of memory, or use itertooples for a slightly more efficient code. Vectorizing is most efficient when you have lots of smaller vectors, not two huge vectors. Commented Jun 20, 2018 at 16:29

4 Answers 4

5

You can also use a compiler like Numba to do this job. This would also outperform the vectorized solution and doesn't need a temporary array.

Example

import numba as nb
import numpy as np
import pandas as pd
import time

@nb.njit(fastmath=True,parallel=True,error_model='numpy')
def your_function(df1_in,df1_out,df1_vals,df2_in,df2_out,df2_vals):
  sum=0.
  for i in nb.prange(len(df1_in)):
      for j in range(len(df2_in)):
          if (df1_in[i] <= df2_out[j] and df1_out[i] >= df2_in[j]):
              sum+=df1_vals[i]*df2_vals[j]
  return sum

Testing

dict1 = {'vals': np.random.randint(1, 100, 1000),
         'in': np.random.randint(1, 10, 1000),
         'out': np.random.randint(1, 10, 1000)}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': np.random.randint(1, 100, 1500),
         'in': 5*np.random.random(1500),
         'out': 5*np.random.random(1500)}
df2 = pd.DataFrame(data=dict2)

# First call has some compilation overhead
res=your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
                  df2['in'].values, df2['out'].values, df2['vals'].values)

t1 = time.time()
for i in range(1000):
  res = your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
                      df2['in'].values, df2['out'].values, df2['vals'].values)

print(time.time() - t1)

Timings

vectorized solution @AGN Gazer: 9.15ms
parallelized Numba Version: 0.7ms
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for introducing me to numba. This solution works great and is fast enough for usage.
4
m1 = np.less_equal.outer(df1['in'], df2['out']) 
m2 = np.greater_equal.outer(df1['out'], df2['in'])
m = np.logical_and(m1, m2)
v12 = np.outer(df1['vals'], df2['vals'])
print(v12[m].sum())

Or, replace first three lines with this long line:

m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
s = np.outer(df1['vals'], df2['vals'])[m].sum()

For very large problems, dask is recommended.

Timing Tests:

Here is a timing comparison when using 1000 and 1500-long arrays:

In [166]: dict1 = {'vals': np.random.randint(1,100,1000), 'in': np.random.randint(1,10,1000), 'out': np.random.randint(1,10,1000)}
     ...: df1 = pd.DataFrame(data=dict1)
     ...: 
     ...: dict2 = {'vals': np.random.randint(1,100,1500), 'in': 5*np.random.random(1500), 'out': 5*np.random.random(1500)}
     ...: df2 = pd.DataFrame(data=dict2)

Author's original method (Python loops):

In [167]: def f(df1, df2):
     ...:     ans = []
     ...:     for i in range(len(df1)):
     ...:         for j in range(len(df2)):
     ...:             if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
     ...:                 ans.append(df1['vals'][i]*df2['vals'][j])
     ...:     return np.sum(ans)
     ...: 
     ...: 

In [168]: %timeit f(df1, df2)
47.3 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

@Ben.T method:

In [170]: %timeit df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) & (df1['out'] >= row['in'])].sum()*row['vals'],1); df2['a
     ...: ns'].sum()
2.22 s ± 40.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Vectorized solution proposed here:

In [171]: def g(df1, df2):
     ...:     m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
     ...:     return np.outer(df1['vals'], df2['vals'])[m].sum()
     ...: 
     ...: 

In [172]: %timeit g(df1, df2)

7.81 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

2 Comments

If df1/df2 are really millions of rows long, m1 will be far too large to materialize.
@DSM Possibly, but this is a solution that vectorizes the problem (at least for less data). What are the alternative solutions, really? Maybe Numba to compile OP loop? Or try dask if it can work with very large data.
2

Your answer:

471 µs ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Method 1 (3+ times slower):

df1.apply(lambda row: list((df2['vals'][(row['in'] <= df2['out']) & (row['out'] >= df2['in'])] * row['vals'])), axis=1).sum()

1.56 ms ± 7.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Method 2 (2+ times slower):

ans = []
for name, row in df1.iterrows():
    _in = row['in']
    _out = row['out']
    _vals = row['vals']
    ans.append(df2['vals'].loc[(df2['in'] <= _out) & (df2['out'] >= _in)].values * _vals)

1.01 ms ± 8.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Method 3 (3+ times faster):

df1_vals = df1.values
ans = np.zeros(shape=(len(df1_vals), len(df2.values)))
for i in range(df1_vals.shape[0]):
    df2_vals = df2.values
    df2_vals[:, 2][~np.logical_and(df1_vals[i, 1] >= df2_vals[:, 0], df1_vals[i, 0] <= df2_vals[:, 1])] = 0
    ans[i, :] = df2_vals[:, 2] * df1_vals[i, 2]

144 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In Method 3 you can view the solution by performing:

ans[ans.nonzero()]

Out[]: array([ 50000.,  80000., 160000.,  60000.]

I wasn't able to think of a way to remove the underlying loop :( but I learnt a lot about numpy in the process! (yay for learning)

6 Comments

When performing timing comparisons, one needs to use large data: otherwise you'll be measuring overhead of array creation, etc.
@AGNGazer Thanks for the advice, that makes sense! One quick question whilst I have your attention, in my method 3 I assigned an array of zeros with the maximum possible shape, to then shrink it down at the end if any zeros still exist. The intention is for the numpy array to occupy that space in memory only once and use ans[i, :] = ... to fill the results. Is this 'numpythonic' at all?
I am not a specialist in "numpthonics" or "pythonics". I think code (at least that I write at my job) must be efficient and easily readable/maintainable. Everything else does not matter to me (too much; I do enjoy elegant solutions, of course). I do not see anything wrong with pre-alocating more memory when you cannot guess exactly how much you need.
@AGNGazer That's a great attitude to have! Are there any sources you would point to in order for me to learn the efficient solutions of pandas/numpy or do you think learning from answers like these is a strong start?
To be honest, I did not use any other resources, and would not recommend any. I think if you try to understand other people solutions AND try to provide answers on SO to people questions (even if at the beginning it will take you longer - don't pay attention to time: the end result and "getting there" are the most important things) are the best things to learn programmig in general and packages such as pandas, numpy, etc.
|
-1

One way to do it is by using apply. Create a column in df2 containing the sum of vals in df1, meeting your criteria on in and out, multiplied by the vals of the row of df2

df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) & 
                                              (df1['out'] >= row['in'])].sum()*row['vals'],1)

then just sum this column

df2['ans'].sum()

4 Comments

I'm not sure what this changes. apply is just a loop under the hood, and the approach is still O(N^2).
@DSM sure but I think it's still faster than two loops for?
I think you're right that the partially vectorized comparison will be faster, but I don't think it'll help enough at the ~1M rows scale, unfort.
So I have tried this and it's still very slow even when the DataFrames are only 10000 entries. I think @DSM is correct about apply just hiding a for loop under the hood?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.