Pandas dataframe string formatting

Question

I have a pandas dataframe with multiple columns. My goal is to apply a complicated function to 3 columns and get a new column of values. Yet I will want to apply the same function to different triplets of columns. Would there be a possibility to use smart string formatting so I don't have to hardcode different names of columns 5 (or more) times?

Rough sketch: Columns('A1','A2','A3','B1','B2','B3',...)

def function(row):
    return row['A1']**2 + row['A2']**3 + row['A3']**4 ### String format here?

do same for B1,2,3; C1,2,3 etc.

Thank you!

I'm unclear as to what you want. Please post an example of what the output should be. Do you want to calculate or present a formula? What are you talking about with string formatting? — piRSquared
– piRSquared, Commented Jul 10, 2017 at 16:18

piRSquared · Accepted Answer · 2017-07-10 16:20:02Z

Using @Milo's setup dataframe df

np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5,9), columns=col_names)
print(df)

     A1    A2    A3    B1    B2    B3    C1    C2    C3
0  0.37  0.95  0.73  0.60  0.16  0.16  0.06  0.87  0.60
1  0.71  0.02  0.97  0.83  0.21  0.18  0.18  0.30  0.52
2  0.43  0.29  0.61  0.14  0.29  0.37  0.46  0.79  0.20
3  0.51  0.59  0.05  0.61  0.17  0.07  0.95  0.97  0.81
4  0.30  0.10  0.68  0.44  0.12  0.50  0.03  0.91  0.26

Then use groupby with columns or axis=1. We use the first letter in the column header as the grouping key.

df.pow(2).groupby(df.columns.str[0], 1).sum(axis=1).pow(.5)

          A         B         C
0  1.256962  0.638019  1.055923
1  1.201048  0.878128  0.633695
2  0.803589  0.488905  0.929715
3  0.785843  0.634367  1.576812
4  0.755317  0.673667  0.946051

Milo · Accepted Answer · 2017-11-13 23:29:14Z

If I understand your question correctly, you want to name your columns according to a specific scheme like "Anumber" and then apply the same operation to them.

One way you can do that is to filter for the naming scheme of the columns you want to address by using regular expressions and then use the apply method to apply your function.

Let's look at an example. I will first construct a DataFrame like so:

import pandas as pd
import numpy as np

np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5,9), columns=col_names)
print df

         A1        A2        A3        B1        B2        B3        C1  \
0  0.374540  0.950714  0.731994  0.598658  0.156019  0.155995  0.058084
1  0.708073  0.020584  0.969910  0.832443  0.212339  0.181825  0.183405
2  0.431945  0.291229  0.611853  0.139494  0.292145  0.366362  0.456070
3  0.514234  0.592415  0.046450  0.607545  0.170524  0.065052  0.948886
4  0.304614  0.097672  0.684233  0.440152  0.122038  0.495177  0.034389

         C2        C3
0  0.866176  0.601115
1  0.304242  0.524756
2  0.785176  0.199674
3  0.965632  0.808397
4  0.909320  0.258780

Then use the filter method in combination with regular expressions. I will exemplarily square every value by using a lambda. But you can use whatever function/operation you like:

print df.filter(regex=r'A\d+').apply(lambda x: x*x)

         A1        A2        A3
0  0.140280  0.903858  0.535815
1  0.501367  0.000424  0.940725
2  0.186576  0.084814  0.374364
3  0.264437  0.350955  0.002158
4  0.092790  0.009540  0.468175

Edit (2017-07-10)

Taking the above examples you could proceed with what you ultimately want to calculate. For example we can calculate the euclidean distance across all A-columns as follows:

df.filter(regex=r'A\d+').apply(lambda x: x*x).sum(axis=1).apply(np.sqrt)

Which results in:

0    1.256962
1    1.201048
2    0.803589
3    0.785843
4    0.755317

So what we essentially computed is sqrt(A1^2 + A2^2 + A3^2 + ... + An^2) for every row.

But since you want to apply separate transformations to separate column naming schemes you would have to hardcode the above method concatenation.

A much more elegant solution to this would be using pipelines. Pipelines basically allow you to define operations on your DataFrame and then combine them the way you need. Again using the example of computing the Euclidean Distance, we could construct a pipeline as follows:

def filter_columns(dataframe, regex):
    """Filter out columns of `dataframe` matched by `regex`."""
    return dataframe.filter(regex=regex)

def op_on_vals(dataframe, op_vals):
    """Apply `op_vals` to every value in the columns of `dataframe`"""
    return dataframe.apply(op_vals)

def op_across_columns(dataframe, op_cols):
    """Apply `op_cols` across the columns of `dataframe`"""

    # Catch exception that would be raised if function
    # would be applied to a pandas.Series.
    try:
        return dataframe.apply(op_cols, axis=1)
    except TypeError:
        return dataframe.apply(op_cols)

For every column naming scheme you can then define the transformations to apply and the order in which they have to be applied. This can for example be done by creating a dictionary that holds the column naming schemes as keys and the arguments for the pipes as values:

pipe_dict = {r'A\d+': [(op_on_vals, np.square), (op_across_columns, np.sum), (op_across_columns, np.sqrt)],
             r'B\d+': [(op_on_vals, np.square), (op_across_columns, np.mean)],
             r'C\d+': [(op_on_vals, lambda x: x**3), (op_across_columns, np.max)]}
# First pipe: Euclidean distance
# Second pipe: Mean of squares
# Third pipe: Maximum cube

df_list = []

for scheme in pipe_dict.keys():
    df_list.append(df.pipe(filter_columns, scheme))
    for (operation, func) in pipe_dict[scheme]:
        df_list[-1] = df_list[-1].pipe(operation, func)

print df_list[0]

0    1.256962
1    1.201048
2    0.803589
3    0.785843
4    0.755317

Getting the same result as above.

Now, this is just an example use and neither very elegant, nor computationally very efficient. It is just to demonstrate the concept of DataFrame pipelines. Taking these concepts, you can go really fancy with this - for example defining pipelines of pipelines etc.

However, taking this example you can achieve your goal of defining an arbitrary order of functions to be executed on your columns. You can now go one step further and apply one function at a time to specific columns, instead of applying functions across all columns.

For example, you can take my op_on_vals function and modify it so that it achieves what you outlined with row['A1']**2, row['A2']**3 and then use .pipe(op_across_columns, np.sum) to implement what you sketched with

def function(row):
    return row['A1']**2 + row['A2']**3 + row['A3']**4

This shouldn't be too difficult, so I will leave the details of this implementation to you.

Edit (2017-07-11)

Here is another piece of code that uses functools.partial in order to create 'function prototypes' of a power function. These can be used to variably set an exponent for the power according to the number in the column names of the DataFrame.

This way we can use the numbers in A1, A2 etc. to calculate value**1, value**2 for each value in the corresponding column. Finally, we can sum them in order to get what you sketched with

row['A1']**2 + row['A2']**3 + row['A3']**4

You can find an excellent explanation of what functools.partial does on PyDanny's Blog. Let's look at the code:

import pandas as pd
import numpy as np
import re

from functools import partial

def power(base, exponent):
    return base ** exponent

# Create example DataFrame.
np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5, 9), columns=col_names)

# Separate 'letter''number' strings of columns into tuples of (letter, number).
match = re.findall(r"([A-Z]+)([0-9]+)", ''.join(df.columns.tolist()))

# Dictionary with 'prototype' functions for each column naming scheme.
func_dict = {'A': power, 'B': power, 'C': power}

# Initialize result columns with zeros.
for letter, _ in match:
    df[letter+'_result'] = np.zeros_like(df[letter+'1'])

# Apply functions to columns
for letter, number in match:
    col_name = ''.join([letter, number])
    teh_function = partial(func_dict[letter], exponent=int(number))
    df[letter+'_result'] += df[col_name].apply(teh_function)

print df

Output:

         A1        A2        A3        B1        B2        B3        C1  \
0  0.374540  0.950714  0.731994  0.598658  0.156019  0.155995  0.058084
1  0.708073  0.020584  0.969910  0.832443  0.212339  0.181825  0.183405
2  0.431945  0.291229  0.611853  0.139494  0.292145  0.366362  0.456070
3  0.514234  0.592415  0.046450  0.607545  0.170524  0.065052  0.948886
4  0.304614  0.097672  0.684233  0.440152  0.122038  0.495177  0.034389

         C2        C3  A_result  B_result  C_result
0  0.866176  0.601115  1.670611  0.626796  1.025551
1  0.304242  0.524756  1.620915  0.883542  0.420470
2  0.785176  0.199674  0.745815  0.274016  1.080532
3  0.965632  0.808397  0.865290  0.636899  2.409623
4  0.909320  0.258780  0.634494  0.576463  0.878582

You can replace the power functions in the func_dict with your own functions, for example one that sums the values with another value or performs some sort of fancy statistical calculations with them.

Using this in combination with the pipeline approach from my earlier edit should give you the tools to get the results that you need.

Thanks, this is useful. But my function would be similar to something like: sqrt(A1 **2 + A2 **2 + A3 **2). Cartesian distance of sorts. Let's say I want to take only A1/2/3 cols and produce a single value. In your example for a different triplet of cols (B1/2/3) I would have to rewrite filter condition and function as I understand. I am trying to avoid copy-pasting same code with only the column labels changed.
A nasty brute-force approach is to just define function acting on columns rather than row by row: def func(a,b,c): return sqrt ( ( df[ str(a)+'1' ] + df[ str(b) + '1' ...] **2 + df[ str(a) + '1' ...] df['new'] = func(A,B,C) It does work and saves me the time of hardcoding different permutations for ABC ABD BDE... but there must be an elegant solution. Also I just understood that I am actually doing dot product with different choice of vectors (x,y,z,w) it would be a nice approach to numpy it out...
I have enhanced my answer according to your comment. Is this more like what you're looking for?
I have made another edit, which in combination with my first one should give you the results that you need. Since your description of what you're trying to do is very vague, I cannot give you a more precise answer to your question. In future questions please be more specific about what your inputs and desired results are, provide a minimal, complete, and verifiable example and choose a more fitting and meaningful title, that describes your problem more precisely. That way we could help each other out even better. Cheers
Thank you very much, it is indeed my second question ever ;) I will try to be more specific next time

Collectives™ on Stack Overflow

Pandas dataframe string formatting

2 Answers 2

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related