If I understand your question correctly, you want to name your columns according to a specific scheme like "Anumber" and then apply the same operation to them.
One way you can do that is to filter for the naming scheme of the columns you want to address by using regular expressions and then use the apply method to apply your function.
Let's look at an example. I will first construct a DataFrame like so:
import pandas as pd
import numpy as np
np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5,9), columns=col_names)
print df
A1 A2 A3 B1 B2 B3 C1 \
0 0.374540 0.950714 0.731994 0.598658 0.156019 0.155995 0.058084
1 0.708073 0.020584 0.969910 0.832443 0.212339 0.181825 0.183405
2 0.431945 0.291229 0.611853 0.139494 0.292145 0.366362 0.456070
3 0.514234 0.592415 0.046450 0.607545 0.170524 0.065052 0.948886
4 0.304614 0.097672 0.684233 0.440152 0.122038 0.495177 0.034389
C2 C3
0 0.866176 0.601115
1 0.304242 0.524756
2 0.785176 0.199674
3 0.965632 0.808397
4 0.909320 0.258780
Then use the filter method in combination with regular expressions. I will exemplarily square every value by using a lambda. But you can use whatever function/operation you like:
print df.filter(regex=r'A\d+').apply(lambda x: x*x)
A1 A2 A3
0 0.140280 0.903858 0.535815
1 0.501367 0.000424 0.940725
2 0.186576 0.084814 0.374364
3 0.264437 0.350955 0.002158
4 0.092790 0.009540 0.468175
Edit (2017-07-10)
Taking the above examples you could proceed with what you ultimately want to calculate. For example we can calculate the euclidean distance across all A-columns as follows:
df.filter(regex=r'A\d+').apply(lambda x: x*x).sum(axis=1).apply(np.sqrt)
Which results in:
0 1.256962
1 1.201048
2 0.803589
3 0.785843
4 0.755317
So what we essentially computed is sqrt(A1^2 + A2^2 + A3^2 + ... + An^2) for every row.
But since you want to apply separate transformations to separate column naming schemes you would have to hardcode the above method concatenation.
A much more elegant solution to this would be using pipelines. Pipelines basically allow you to define operations on your DataFrame and then combine them the way you need. Again using the example of computing the Euclidean Distance, we could construct a pipeline as follows:
def filter_columns(dataframe, regex):
"""Filter out columns of `dataframe` matched by `regex`."""
return dataframe.filter(regex=regex)
def op_on_vals(dataframe, op_vals):
"""Apply `op_vals` to every value in the columns of `dataframe`"""
return dataframe.apply(op_vals)
def op_across_columns(dataframe, op_cols):
"""Apply `op_cols` across the columns of `dataframe`"""
# Catch exception that would be raised if function
# would be applied to a pandas.Series.
try:
return dataframe.apply(op_cols, axis=1)
except TypeError:
return dataframe.apply(op_cols)
For every column naming scheme you can then define the transformations to apply and the order in which they have to be applied. This can for example be done by creating a dictionary that holds the column naming schemes as keys and the arguments for the pipes as values:
pipe_dict = {r'A\d+': [(op_on_vals, np.square), (op_across_columns, np.sum), (op_across_columns, np.sqrt)],
r'B\d+': [(op_on_vals, np.square), (op_across_columns, np.mean)],
r'C\d+': [(op_on_vals, lambda x: x**3), (op_across_columns, np.max)]}
# First pipe: Euclidean distance
# Second pipe: Mean of squares
# Third pipe: Maximum cube
df_list = []
for scheme in pipe_dict.keys():
df_list.append(df.pipe(filter_columns, scheme))
for (operation, func) in pipe_dict[scheme]:
df_list[-1] = df_list[-1].pipe(operation, func)
print df_list[0]
0 1.256962
1 1.201048
2 0.803589
3 0.785843
4 0.755317
Getting the same result as above.
Now, this is just an example use and neither very elegant, nor computationally very efficient. It is just to demonstrate the concept of DataFrame pipelines. Taking these concepts, you can go really fancy with this - for example defining pipelines of pipelines etc.
However, taking this example you can achieve your goal of defining an arbitrary order of functions to be executed on your columns. You can now go one step further and apply one function at a time to specific columns, instead of applying functions across all columns.
For example, you can take my op_on_vals function and modify it so that it achieves what you outlined with row['A1']**2, row['A2']**3 and then use .pipe(op_across_columns, np.sum) to implement what you sketched with
def function(row):
return row['A1']**2 + row['A2']**3 + row['A3']**4
This shouldn't be too difficult, so I will leave the details of this implementation to you.
Edit (2017-07-11)
Here is another piece of code that uses functools.partial in order to create 'function prototypes' of a power function. These can be used to variably set an exponent for the power according to the number in the column names of the DataFrame.
This way we can use the numbers in A1, A2 etc. to calculate value**1, value**2 for each value in the corresponding column. Finally, we can sum them in order to get what you sketched with
row['A1']**2 + row['A2']**3 + row['A3']**4
You can find an excellent explanation of what functools.partial does on PyDanny's Blog. Let's look at the code:
import pandas as pd
import numpy as np
import re
from functools import partial
def power(base, exponent):
return base ** exponent
# Create example DataFrame.
np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5, 9), columns=col_names)
# Separate 'letter''number' strings of columns into tuples of (letter, number).
match = re.findall(r"([A-Z]+)([0-9]+)", ''.join(df.columns.tolist()))
# Dictionary with 'prototype' functions for each column naming scheme.
func_dict = {'A': power, 'B': power, 'C': power}
# Initialize result columns with zeros.
for letter, _ in match:
df[letter+'_result'] = np.zeros_like(df[letter+'1'])
# Apply functions to columns
for letter, number in match:
col_name = ''.join([letter, number])
teh_function = partial(func_dict[letter], exponent=int(number))
df[letter+'_result'] += df[col_name].apply(teh_function)
print df
Output:
A1 A2 A3 B1 B2 B3 C1 \
0 0.374540 0.950714 0.731994 0.598658 0.156019 0.155995 0.058084
1 0.708073 0.020584 0.969910 0.832443 0.212339 0.181825 0.183405
2 0.431945 0.291229 0.611853 0.139494 0.292145 0.366362 0.456070
3 0.514234 0.592415 0.046450 0.607545 0.170524 0.065052 0.948886
4 0.304614 0.097672 0.684233 0.440152 0.122038 0.495177 0.034389
C2 C3 A_result B_result C_result
0 0.866176 0.601115 1.670611 0.626796 1.025551
1 0.304242 0.524756 1.620915 0.883542 0.420470
2 0.785176 0.199674 0.745815 0.274016 1.080532
3 0.965632 0.808397 0.865290 0.636899 2.409623
4 0.909320 0.258780 0.634494 0.576463 0.878582
You can replace the power functions in the func_dict with your own functions, for example one that sums the values with another value or performs some sort of fancy statistical calculations with them.
Using this in combination with the pipeline approach from my earlier edit should give you the tools to get the results that you need.