Dynamically accessing a pandas dataframe column

Question

Consider this simple example

import pandas as pd

df = pd.DataFrame({'one' : [1,2,3],
                   'two' : [1,0,0]})

df 
Out[9]: 
   one  two
0    1    1
1    2    0
2    3    0

I want to write a function that takes as inputs a dataframe df and a column mycol.

Now this works:

df.groupby('one').two.sum()
Out[10]: 
one
1    1
2    0
3    0
Name: two, dtype: int64

this works too:

 def okidoki(df,mycol):
    return df.groupby('one')[mycol].sum()

okidoki(df, 'two')
Out[11]: 
one
1    1
2    0
3    0
Name: two, dtype: int64

but this FAILS

def megabug(df,mycol):
    return df.groupby('one').mycol.sum()

megabug(df, 'two')
 AttributeError: 'DataFrameGroupBy' object has no attribute 'mycol'

What is wrong here?

I am worried that okidoki uses some chaining that might create some subtle bugs (https://pandas.pydata.org/pandas-docs/stable/indexing.html#why-does-assignment-fail-when-using-chained-indexing).

How can I still keep the syntax groupby('one').mycol? Can the mycol string be converted to something that might work that way? Thanks!

cs95 · Accepted Answer · 2017-08-28 14:45:21Z

4

You pass a string as the second argument. In effect, you're trying to do something like:

df.'two'

Which is invalid syntax. If you're trying to dynamically access a column, you'll need to use the index notation, [...] because the dot/attribute accessor notation doesn't work for dynamic access.

Dynamic access on its own is possible. For example, you can use getattr (but I don't recommend this, it's an antipattern):

In [674]: df
Out[674]: 
   one  two
0    1    1
1    2    0
2    3    0

In [675]: getattr(df, 'one')
Out[675]: 
0    1
1    2
2    3
Name: one, dtype: int64

Dynamically selecting by attribute from a groupby call can be done, something like:

In [677]: getattr(df.groupby('one'), mycol).sum() 
Out[677]: 
one
1    1
2    0
3    0
Name: two, dtype: int64

But don't do it. It is a horrid anti pattern, and much more unreadable than df.groupby('one')[mycol].sum().

edited Aug 28, 2017 at 14:45

answered Aug 28, 2017 at 14:41

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ℕʘʘḆḽḘ Over a year ago

thanks coldspeed. I have edited my question. My point is, given a string as an input, is it possible to kind of convert it to something that will work with this syntax? say notastring = magicfunction(mycol) and then df.notastring

cs95 Over a year ago

@ℕℴℴḆḽḘ Edited my answer again. It is possible but it is a horrid anti pattern. Don't do it.

jezrael · Accepted Answer · 2017-08-28 14:45:43Z

3

I think you need [] for select column by column name what is general solution for selecting columns, because select by attributes have many exceptions:

You can use this access only if the index element is a valid python identifier, e.g. s.1 is not allowed. See here for an explanation of valid identifiers.

The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.

Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items, labels.

In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.

def megabug(df,mycol):
    return df.groupby('one')[mycol].sum()

print (megabug(df, 'two'))

one
1    1
2    0
3    0
Name: two, dtype: int64

edited Aug 28, 2017 at 14:45

answered Aug 28, 2017 at 14:39

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

1 Comment

ℕʘʘḆḽḘ Over a year ago

yes jezrael, this is the okidoki function above actually :D. My question is why it is the case?

Collectives™ on Stack Overflow

Dynamically accessing a pandas dataframe column

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related