2

I have the following operation:

import pandas as pd
import numpy as np

def some_calc(x,y):
    x = x.set_index('Cat')
    y = y.set_index('Cat')
    y = np.sqrt(y['data_point2'])
    vec = pd.DataFrame(x['data_point1'] * y)
    grid = np.random.rand(len(x),len(x))
    result = vec.dot(vec.T).mul(grid).sum().sum()
    return result

sample_size = 100
cats = ['a','b','c','d']

df1 = pd.DataFrame({'Cat':[cats[np.random.randint(4)] for _ in range(sample_size)],
                    'data_point1':np.random.rand(sample_size),
                    'data_point2':np.random.rand(sample_size)})

df2 = df1.groupby('Cat').sum().reset_index()

I would like to run some_calc across each of the df2 rows using their relative data points from df1.

The code below works well:

df2['Apply'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']], 
                                             y=df1[df1['Cat']==x['Cat']][['Cat','data_point2']]),axis=1)

(I reset the index in df2 because I don't know how to apply across indices. Also, I'm passing both Cat as the index field and data_point as vectors to some_calc because without an index v.dot(v.T) will crunch the dot product into one single number. This errors with .mul() because I need the full MxM matrix as opposed to a float value. I might be doing something wrong here though...)

I'm currently exploring how I can vectorize the above so that when sample_size grows I will not be hampered by a slow down in the calculation.

I saw that in previous threads you can toggle raw=True so that the input deal with np.array as opposed to pd.Series.

df2['ApplyRaw'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']], 
                                                y=df1[df1['Cat']==x['Cat']]['Cat','data_point2']),axis=1, raw=True)

However, it throws an error:

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

I tried omitting Cat from the argument but still the same issue.

Are there any code improvements or tricks I can employ that allow me to vectorize the above? Or do I have to amend some_calc?

1 Answer 1

4
+25

I'm not sure if it's possible to vectorize your function since it's a bit complex. However, some_calc itself and how it is called can be optimized.

What

df2['Apply'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']], 
                                             y=df1[df1['Cat']==x['Cat']][['Cat','data_point2']]),axis=1)

does is basically the same as a groupby. So instead of creating these groups and applying the function on them, use groupby + apply. Simplifying the some_calc function as well, we get:

def some_calc(df):
    x = df['data_point1'].values
    y = np.sqrt(df['data_point2'].values)
    vec = (x * y).reshape(-1, 1)
    grid = np.random.rand(len(x),len(x))
    result = (vec @ vec.T * grid).sum().sum()
    return result

apply = df1.groupby('Cat').apply(some_calc)
apply.name = 'Apply'
df2.merge(apply, left_on='Cat', right_index=True)

The final merge is just to add the results to the df2 dataframe.

Timings:

# original
20.5 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# above code
3.62 ms ± 668 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sign up to request clarification or add additional context in comments.

3 Comments

that's a very good suggestion. However, I think I over simplified the problem. The array grid in some_calc is also an input in some edge cases. I will amend the question to reflect the above. Apologies
@RealRageDontQuit: I think you missed adding the edit to the question? If it's regarding extra arguments to the apply, you can see the following: stackoverflow.com/questions/43483365/…
A detailed blog post about Pandas performance: tomaugspurger.github.io/modern-4-performance

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.