49

I've checked out map, apply, mapapply, and combine, but can't seem to find a simple way of doing the following:

I have a dataframe with 10 columns. I need to pass three of them into a function that takes scalars and returns a scalar ...

some_func(int a, int b, int c) returns int d

I want to apply this and create a new column in the dataframe with the result.

df['d'] = some_func(a = df['a'], b = df['b'], c = df['c'])

All the solutions that I've found seem to suggest to rewrite some_func to work with Series instead of scalars, but this is not possible as it is part of another package. How do I elegantly do the above?

4
  • 3
    It depends on what you functions are doing but typically you would do something like def func(row): return row['a'] * row['b'] * row['c'] df.apply( lambda row: func(row), axis = 1) ideally you want to write your function in a way so that it can operate on the entire series so it's vectorised, can you show what you are really trying to do Commented Feb 11, 2015 at 14:48
  • If for instance your function took Series as params then you could rewrite it to def some_func(a,b,c): return a*b*c df['d'] = some_func(df['a'], df['b'], df['c']) Commented Feb 11, 2015 at 14:50
  • "some_func" is a complex function that makes a SQL call to fill the data, so I have simplified it here. I'm using df.apply as suggested. Commented Feb 11, 2015 at 16:50
  • Hello @ashishsingal, if you agree that my answer is correct, please could you select it as the answer for this question? Cheers, Tomas Commented Nov 13, 2017 at 11:01

7 Answers 7

46

Use pd.DataFrame.apply(), as below:

df['d'] = df.apply(lambda x: some_func(a = x['a'], b = x['b'], c = x['c']), axis=1)

NOTE: As @ashishsingal asked about columns, the axis argument should be provided with a value of 1, as the default is 0 (as in the documentation and copied below).

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

  • 0 or ‘index’: apply function to each column
  • or ‘columns’: apply function to each row
Sign up to request clarification or add additional context in comments.

Comments

22

For what it's worth on such an old question; I find that zipping function arguments into tuples and then applying the function as a list comprehension is much faster than using df.apply. For example:

import pandas as pd

# Setup:
df = pd.DataFrame(np.random.rand(10000, 3), columns=list("abc"))
def some_func(a, b, c):
    return a*b*c

# Using apply:
%timeit df['d'] = df.apply(lambda x: some_func(a = x['a'], b = x['b'], c = x['c']), axis=1)

222 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Using tuples + list comprehension:
%timeit df["d"] = [some_func(*a) for a in tuple(zip(df["a"], df["b"], df["c"]))]

8.07 ms ± 640 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

3 Comments

Hi @toby-petty, can this method be used when the function is returning 2 values, which can be assigned to new columns of the dataframe. df[['c','d']] = [some_func(*a) for a in tuple(zip(df["a"], df["b"], df["c"]))]
Hi @ML_Passion, yes it will work exactly as you put it, so long as you change some_func to return 2 values instead of 1. Actually the apply method wouldn't work for the use case of adding multiple columns at once, it would need to be applied multiple times, making it even slower, so that's another win for this method.
Thanks a ton @toby-petty, i have a use case for this right now in my work. I want to wrap in a parallelize function using multiprocessing. Having some challenges but should be able to solve it.
8

I use map that is as fast as list comprehension (much faster than apply):

df['d'] = list(map(some_func, df['a'], df['b'], df['c']))

Example on my machine:

import pandas as pd

# Setup:
df = pd.DataFrame(np.random.rand(10000, 3), columns=list("abc"))
def some_func(a, b, c):
    return a*b*c

# Using apply:
%timeit df['d'] = df.apply(lambda x: some_func(a = x['a'], 
b = x['b'], c = x['c']), axis=1)

130 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['d'] = list(map(some_func, df['a'], df['b'], df['c']))

3.91 ms ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

1 Comment

Surely this is the difference between element and vector processing? Perfectly accurate if 'some_func' supports vector processing. However, the OP said 'some_func' was complex (including SQL calls) and did NOT support vector processing. Am I missing something here?
4

I'm using the following:

df['d'] = df.apply(lambda x: some_func(a = x['a'], b = x['b'], c = x['c']))

Seems to be working well, but if anyone else has a better solution, please let me know.

Comments

2

Very nice tip to use a list comprehension like Toby Petty recommended

df["d"] = [some_func(*a) for a in tuple(zip(df["a"], df["b"], df["c"]))]

This can be further optimized by removing the tuple instantiation

df["d"] = [some_func(*a) for a in zip(df["a"], df["b"], df["c"])]

A even faster way to map multiple columnns is to use frompyfunc from numpy to create a vectorized version of the python function

import numpy as np
    
some_func_vec = np.frompyfunc(some_func, 3, 1)
df["d"] = some_func_vec(df["a"], df["b"], df["c"])

Comments

0

If it is a really simple function, such as one based on simple arithmetic, chances are it can be vectorized. For instance, a linear combination can be made directly from the columns:

df["d"] = w1*df["a"] + w2*df["b"] + w3*["c"]

where w1,w2,w3 are scalar weights.

Comments

0

You can also

df['d'] = df.agg(lambda row : some_function(row.a, row.b, row.c), axis=1)

I think it is much faster than df.apply.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.