Syntactic sugar for derived variables from Pandas DataFrame columns

Question

Update: Okay, after trying to use this for a while, I think it's probably a bad idea. Please use `(lambda x: x["a"] + x["b"])(df)` if really necessary or use `df.assign(c=lambda x: x["a"] + x["b"])` (with CoW enabled for performance reasons) which supports chaining!

I've a syntactic sugar hack to make it easier to create and temporarily use derived columns from DataFrames by applying a function on the columns, and I welcome any comments! Here is the code:

import pandas as pd
from typing import Callable

def pda(df: pd.DataFrame, f: Callable, numpy: bool = True):
    return (f(*(df[col].values
                for col in f.__code__.co_varnames)) if numpy else f(
                    *(df[col] for col in f.__code__.co_varnames)))

df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})
df["c"] = pda(df, lambda a, b: a + b)
print(df)

This results in:

Advantages:

Python prettifying and syntax highlighting on function code (as compared to df["c"] = df.eval("a + b"))
No need to repeat DataFrame variable name (as compared to df["c"] = df["a"] + df["b"])
Possible to create temporary numpy arrays, and probably better performance (as compared to df = df.assign(c=lambda x: x["a"] + x["b"]))

So, if we had, let's say 8 columns, we could use df['i'] = pda(df, lambda _, __, c, ___, ____, f, _____, ______: c + f) instead of df['i'] = df.c + df.f. Is that right ? — 301_Moved_Permanently
– 301_Moved_Permanently, Commented Oct 9, 2023 at 11:07
@301_Moved_Permanently nope, you're supposed to just use df['i'] = pda(df, lambda c, f: c + f) — user1537366
– user1537366, Commented Oct 10, 2023 at 3:23
Welcome to Code Review! Incorporating advice from an answer into the question violates the question-and-answer nature of this site. You could post improved code as a new question, as an answer, or as a link to an external site - as described in I improved my code based on the reviews. What next?. I have rolled back the edit, so the answers make sense again. — Toby Speight
– Toby Speight, Commented Oct 10, 2023 at 6:55
@TobySpeight got it, I just didn't want people to take my slightly incorrect version of the code to use. — user1537366
– user1537366, Commented Oct 10, 2023 at 8:18

Reinderien · Accepted Answer · 2023-10-09 21:55:20Z

Starting broadly: this relies on reflection, which is not unheard of in the data analytics ecosystem (see e.g.: curve_fit's use of argspec). So it wouldn't be entirely without precedent, but it's still in a broad sense not very Pythonic (PEP20's "explicit is better than implicit"). This very much relies on magical, implicit behaviour, and for that reason alone it isn't a wonderful idea.

Python prettifying and syntax highlighting is less important than the related, but fairly different, static analysis. Your approach is only better in terms of static analysis if you jettison the lambda and write an actual function with good typehints; otherwise, it's only marginally better than eval.

Possible to create temporary numpy arrays, and probably better performance is dubious, and I will not place any belief in this unless I see a benchmark.

Crucially, __code__.co_varnames is wrong; read the docs:

tuple of names of arguments and local variables

If you have a local variable defined to be the same name as a column from the dataframe, you'll attempt to pass it in and then explosions. Use inspect.signature instead.

A much simpler technique that I think does cross the line into "worth doing, sometimes" relies on the fact that a DataFrame is already a map-like:

import pandas as pd


def add(a: pd.Series, b: pd.Series) -> pd.Series:
    c = a + b
    return c


df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})
df["c"] = add(**df)
print(df)

Thanks for pointing out that problem with __code__.co_varnames. I think it is still guaranteed to start with the argument names in the order it appeared, so slicing it works. — user1537366
– user1537366, Commented Oct 10, 2023 at 3:54
Your idea about using the map-like property of the DataFrame does not work as soon as it has more columns than you need in the function, and then you will need to slice the DataFrame and this requires repeating the argument names again. ((lambda a, b: a + b)(**{"a": 10, "b": 20, "c": 30}) throws an error) — user1537366
– user1537366, Commented Oct 10, 2023 at 4:01
@user1537366 that's deliberate, but if you don't like it, just add a **kwargs. — Reinderien
– Reinderien, Commented Oct 10, 2023 at 11:49

J_H · Accepted Answer · 2023-10-09 23:37:51Z

I agree with @Reinderien.

docstring

pda lacks a docstring, and it absolutely needs one.

Consider using doctest notation at the end of it.

one function or two

def pda( ... , numpy: bool = True):

Thank you for the type hinting.

It's not clear that a "numpy" parameter is a win, here. Consider offering a pair of functions instead, perhaps pda and pda_numpy.

conditional

                ... if numpy else ...

Sandwiching an if between large expressions is not helping readability.

Prefer

    if numpy:
        return ...
    else:
        return ...

Readability might be improved if we DRY this up a bit. Consider assigning df[col].values or df[col] to a temp var, and then work with that.

(Since you're keen on automagic, perhaps use getattr to probe for a "values" attribute, and then we don't need a numpy flag? But it's possible we get a spurious "values" hit. Maybe consult isinstance?)

\$\begingroup\$ Thanks! Edited and incorporated many of your suggestions. \$\endgroup\$

user1537366
– user1537366

2023-10-10 03:53:45 +00:00
Commented Oct 10, 2023 at 3:53 — user1537366
– user1537366, Commented Oct 10, 2023 at 3:53

user1537366 · Accepted Answer · 2023-10-12 08:19:27Z

As the original poster, I have revised the code based on the many answers as follows:

Add a docstring
Use the magic doctest for unit testing
Removed the numpy parameter (a separate function would probably be better)
Separate the if-else expression into an if-else block
Renamed the numpy variable to use_numpy for clarity
Use slicing to extract the correct part of co_varnames which correspond to the argument names. The docs seem to imply that this works:

co_varnames

Returns a tuple containing the names of the local variables (starting with the argument names).

Using inspect.signature instead of co_varnames causes a performance hit, so I reverted to using co_varnames.

import pandas as pd
from typing import Callable

def pda(df: pd.DataFrame, f: Callable, use_numpy: bool = True):
    """Performs a function `f` on columns of DataFrame `df`,
    as NumPy arrays or as Pandas' Series.
    
    Function `f` will be performed on the columns of `df`
    corresponding to the argument names of `f`.

    Args:
        df (pd.DataFrame): input DataFrame
        f (Callable): function to be performed
        use_numpy (bool, optional, defaults to True): use NumPy arrays instead of Series

    Returns:
        resulting numpy array if `use_numpy` else resulting Series

    Example:

    ```
    >>> df = pd.DataFrame({
    ...     "d": [1, 2, 3, 4],
    ...     "a": [2, 3, 4, 5],
    ...     "c": [3, 4, 5, 6],
    ...     "b": [4, 5, 6, 7]
    ... })
    >>> df["e"] = pda(df, lambda c, a: c - a)
    >>> print(df)
       d  a  c  b  e
    0  1  2  3  4  1
    1  2  3  4  5  1
    2  3  4  5  6  1
    3  4  5  6  7  1
    
    ```
    """
    if use_numpy:
        return f(*(df[f.__code__.co_varnames[i]].values
                   for i in range(f.__code__.co_argcount)))
    else:
        return f(*(df[f.__code__.co_varnames[i]]
                   for i in range(f.__code__.co_argcount)))

if __name__ == "__main__":
    import doctest
    doctest.testmod()

I also did some timing comparisons between the methods.

#!/usr/bin/env python3
import inspect
import random
from collections import defaultdict
from typing import Callable

import numpy as np
import pandas as pd


def main():
    import doctest
    doctest.testmod()
    import timeit
    df = pd.DataFrame({
        "d": np.random.random(100000),
        "a": np.random.random(100000),
        "c": np.random.random(100000),
        "b": np.random.random(100000)
    })
    tests = [
        test_pda, test_pda_series, test_pda2, test_lambda, test_eval,
        test_index, test_dot, test_assign
    ]
    timings = defaultdict(float)
    for i in range(1000):
        random.shuffle(tests)
        for test in tests:
            timings[test.__name__] += timeit.timeit("test(df)",
                                                    number=1,
                                                    globals={
                                                        "test": test,
                                                        "df": df
                                                    })
    for test_name, timing in timings.items():
        print(test_name, timing)


def test_pda(df):
    """
    ```
    >>> df = pd.DataFrame({
    ...     "d": [1, 2, 3, 4],
    ...     "a": [2, 3, 4, 5],
    ...     "c": [3, 4, 5, 6],
    ...     "b": [4, 5, 6, 7]
    ... })
    >>> df = test_pda(df)
    >>> print(df)
       d  a  c  b  e
    0  1  2  3  4  1
    1  2  3  4  5  1
    2  3  4  5  6  1
    3  4  5  6  7  1
    
    ```
    """

    df["e"] = pda(df, lambda c, a: c - a)
    return df


def test_pda_series(df):
    """
    ```
    >>> df = pd.DataFrame({
    ...     "d": [1, 2, 3, 4],
    ...     "a": [2, 3, 4, 5],
    ...     "c": [3, 4, 5, 6],
    ...     "b": [4, 5, 6, 7]
    ... })
    >>> df = test_pda_series(df)
    >>> print(df)
       d  a  c  b  e
    0  1  2  3  4  1
    1  2  3  4  5  1
    2  3  4  5  6  1
    3  4  5  6  7  1
    
    ```
    """

    df["e"] = pda(df, lambda c, a: c - a, False)
    return df


def test_pda2(df):
    """
    ```
    >>> df = pd.DataFrame({
    ...     "d": [1, 2, 3, 4],
    ...     "a": [2, 3, 4, 5],
    ...     "c": [3, 4, 5, 6],
    ...     "b": [4, 5, 6, 7]
    ... })
    >>> df = test_pda2(df)
    >>> print(df)
       d  a  c  b  e
    0  1  2  3  4  1
    1  2  3  4  5  1
    2  3  4  5  6  1
    3  4  5  6  7  1
    
    ```
    """

    df["e"] = pda2(df, lambda c, a: c - a)
    return df


def test_lambda(df):
    """
    ```
    >>> df = pd.DataFrame({
    ...     "d": [1, 2, 3, 4],
    ...     "a": [2, 3, 4, 5],
    ...     "c": [3, 4, 5, 6],
    ...     "b": [4, 5, 6, 7]
    ... })
    >>> df = test_pda(df)
    >>> print(df)
       d  a  c  b  e
    0  1  2  3  4  1
    1  2  3  4  5  1
    2  3  4  5  6  1
    3  4  5  6  7  1
    
    ```
    """
    df["e"] = (lambda x: x["c"].values - x["a"].values)(df)
    return df


def test_eval(df):
    """
    ```
    >>> df = pd.DataFrame({
    ...     "d": [1, 2, 3, 4],
    ...     "a": [2, 3, 4, 5],
    ...     "c": [3, 4, 5, 6],
    ...     "b": [4, 5, 6, 7]
    ... })
    >>> df = test_eval(df)
    >>> print(df)
       d  a  c  b  e
    0  1  2  3  4  1
    1  2  3  4  5  1
    2  3  4  5  6  1
    3  4  5  6  7  1
    
    ```
    """
    df["e"] = df.eval("c - a")
    return df


def test_index(df):
    """
    ```
    >>> df = pd.DataFrame({
    ...     "d": [1, 2, 3, 4],
    ...     "a": [2, 3, 4, 5],
    ...     "c": [3, 4, 5, 6],
    ...     "b": [4, 5, 6, 7]
    ... })
    >>> df = test_index(df)
    >>> print(df)
       d  a  c  b  e
    0  1  2  3  4  1
    1  2  3  4  5  1
    2  3  4  5  6  1
    3  4  5  6  7  1
    
    ```
    """
    df["e"] = df["c"].values - df["a"].values
    return df


def test_dot(df):
    """
    ```
    >>> df = pd.DataFrame({
    ...     "d": [1, 2, 3, 4],
    ...     "a": [2, 3, 4, 5],
    ...     "c": [3, 4, 5, 6],
    ...     "b": [4, 5, 6, 7]
    ... })
    >>> df = test_dot(df)
    >>> print(df)
       d  a  c  b  e
    0  1  2  3  4  1
    1  2  3  4  5  1
    2  3  4  5  6  1
    3  4  5  6  7  1
    
    ```
    """
    df["e"] = df.c.values - df.a.values
    return df


def test_assign(df: pd.DataFrame):
    """
    ```
    >>> df = pd.DataFrame({
    ...     "d": [1, 2, 3, 4],
    ...     "a": [2, 3, 4, 5],
    ...     "c": [3, 4, 5, 6],
    ...     "b": [4, 5, 6, 7]
    ... })
    >>> df = test_assign(df)
    >>> print(df)
       d  a  c  b  e
    0  1  2  3  4  1
    1  2  3  4  5  1
    2  3  4  5  6  1
    3  4  5  6  7  1
    
    ```
    """
    return df.assign(e=lambda x: x["c"].values - x["a"].values)


def pda(df: pd.DataFrame, f: Callable, use_numpy: bool = True):
    """Performs a function `f` on columns of DataFrame `df`,
    as NumPy arrays or as Pandas' Series.
    
    Function `f` will be performed on the columns of `df`
    corresponding to the argument names of `f`.

    Args:
        df (pd.DataFrame): input DataFrame
        f (Callable): function to be performed
        use_numpy (bool, optional, defaults to True): use NumPy arrays instead of Series

    Returns:
        resulting numpy array if `use_numpy` else resulting Series

    Example:

    ```
    >>> df = pd.DataFrame({
    ...     "d": [1, 2, 3, 4],
    ...     "a": [2, 3, 4, 5],
    ...     "c": [3, 4, 5, 6],
    ...     "b": [4, 5, 6, 7]
    ... })
    >>> df["e"] = pda(df, lambda c, a: c - a)
    >>> print(df)
       d  a  c  b  e
    0  1  2  3  4  1
    1  2  3  4  5  1
    2  3  4  5  6  1
    3  4  5  6  7  1
    
    ```
    """
    if use_numpy:
        return f(*(df[f.__code__.co_varnames[i]].values
                   for i in range(f.__code__.co_argcount)))
    else:
        return f(*(df[f.__code__.co_varnames[i]]
                   for i in range(f.__code__.co_argcount)))


def pda2(df: pd.DataFrame, f: Callable, use_numpy: bool = True):
    if use_numpy:
        return f(*(df[param.name].values
                   for param in inspect.signature(f).parameters.values()))
    else:
        return f(*(df[param.name]
                   for param in inspect.signature(f).parameters.values()))


if __name__ == "__main__":
    main()

The results for Python 3.11.2, Pandas 2.1.1 and NumPy 1.26.0 show that pda is surprisingly on par in terms of performance as the best other methods (indexing and member access). As expected, .assign has terrible performance because it is copying the entire DataFrame.

Timings (lower is better):

test_index 0.16944104398862692
test_assign 2.891109986925585
test_pda 0.1570397199393483
test_eval 0.8307543109549442
test_pda2 0.18781333995138993
test_lambda 0.1599503229081165
test_dot 0.16240537503472297
test_pda_series 0.2198283309226099

Maybe consider adding an arg_names = f.__code__.co_varnames[:f.__code__.co_argcount] before the if to reduce line length and ease overall comprehension. — 301_Moved_Permanently
– 301_Moved_Permanently, Commented Oct 10, 2023 at 9:30
Instead of the argcount band-aid on what is still the incorrect var name metavariable, you really should just call the better API (inspect.signature) - or, really, not do any of this. — Reinderien
– Reinderien, Commented Oct 10, 2023 at 14:10
@301_Moved_Permanently I did what you suggested, but I'm slightly concerned adding a new variable might spoil bytecode optimisation — user1537366
– user1537366, Commented Oct 11, 2023 at 6:02
@Reinderien it is not the "incorrect var name metavariable". The docs guarantee this. — user1537366
– user1537366, Commented Oct 11, 2023 at 6:04

Stack Exchange Network

Syntactic sugar for derived variables from Pandas DataFrame columns

Update: Okay, after trying to use this for a while, I think it's probably a bad idea. Please use `(lambda x: x["a"] + x["b"])(df)` if really necessary or use `df.assign(c=lambda x: x["a"] + x["b"])` (with CoW enabled for performance reasons) which supports chaining!

3 Answers 3

docstring

one function or two

conditional

You must log in to answer this question.

Hot Network Questions

Syntactic sugar for derived variables from Pandas DataFrame columns

Update: Okay, after trying to use this for a while, I think it's probably a bad idea. Please use (lambda x: x["a"] + x["b"])(df) if really necessary or use df.assign(c=lambda x: x["a"] + x["b"]) (with CoW enabled for performance reasons) which supports chaining!

3 Answers 3

docstring

one function or two

conditional

You must log in to answer this question.

Related

Hot Network Questions

Update: Okay, after trying to use this for a while, I think it's probably a bad idea. Please use `(lambda x: x["a"] + x["b"])(df)` if really necessary or use `df.assign(c=lambda x: x["a"] + x["b"])` (with CoW enabled for performance reasons) which supports chaining!