Python Pandas Dataframe - row level operations

Question

I need to do a large number of row level operations (a few pages of code) on a table of data.

E.g. if row.Col_A == 'X' : row.Col_B = 'Y'

I believe iterrows isn't appropriate for altering table values. So I've converted the table to a list of DotMap dictionaries. With this I can loop over the list and for each dictionary (row), write the code as above and the alterations are saved.

Is it possible to do this with the data as a DataFrame ?

There is a lot of logic and I think its clearest written this way so I'd prefer not to use map or apply functions.

I like to use for i, j in zip(df['A'], df['B']): if i == 1: j == 2 etc... You can loop through multiple columns in parallel with zip. — David Erickson
– David Erickson, Commented Jun 24, 2020 at 6:11
The underlying storage of a DataFrame is a collection of numpy arrays, so you can iterate over that. It is recommended not to do so for performance reasons, because Python loops are much slower than Pandas or numpy methods, but it is not worse than iterating over any other Python container (provided you directly access the dataframe and not a view over it so, you are true, iterrows should be avoided). — Serge Ballesta
– Serge Ballesta, Commented Jun 24, 2020 at 6:17
@SergeBallesta I agree, but there are situations where you cannot avoid looping, although, I think sometimes it is inevitable. I think that creating functions, apply(lambda x), pandas/numpy vectorized methods and list comprehension can do 95% of the job. — David Erickson
– David Erickson, Commented Jun 24, 2020 at 6:31
I have 20 columns and 20 operations to apply. Can this be extended to all columns. Something like for *df.columns in df.itertuples(): — Lincoln Hannah
– Lincoln Hannah, Commented Jun 24, 2020 at 6:32
@LincolnHannah as an alternative to looping. the Vectorized np.where() is one of the most powerful methods that I use constantly. I would look into that. — David Erickson
– David Erickson, Commented Jun 24, 2020 at 6:33

cyau · Accepted Answer · 2020-06-24 06:33:07Z

Let's have the following example dataframe:

import pandas as pd
import numpy as np

some_data = pd.DataFrame({
    'col_a': [1, 2, 1, 2, 3, 4, 3, 4],
    'col_b': ['a', 'b', 'c', 'c', 'a', 'b', 'z', 'z']
})

We want to create a new column based on one (or more) of the existing columns' values.

In case you have only two options, I would suggest using numpy.where like this:

some_data['np_where_example'] = np.where(some_data.col_a < 3, 'less_than_3', 'greater_than_3')
print(some_data)
>>>
   col_a col_b           col_c map_example np_where_example  \
0      1     a     less_than_3         NaN      less_than_3   
1      2     b     less_than_3         BBB      less_than_3   
2      1     c     less_than_3         NaN      less_than_3   
3      2     c     less_than_3         NaN      less_than_3   
4      3     a  greater_than_3         NaN   greater_than_3   
5      4     b  greater_than_3         BBB   greater_than_3   
6      3     z  greater_than_3         ZZZ   greater_than_3   
7      4     z  greater_than_3         ZZZ   greater_than_3 

# multiple conditions
some_data['np_where_multiple_conditions'] = np.where(((some_data.col_a >= 3) & (some_data.col_b == 'z')),
                                                     'is_true',
                                                     'is_false')
print(some_data)
>>>
   col_a col_b np_where_multiple_conditions
0      1     a                     is_false
1      2     b                     is_false
2      1     c                     is_false
3      2     c                     is_false
4      3     a                     is_false
5      4     b                     is_false
6      3     z                      is_true
7      4     z                      is_true

In case you have many options, then pandas.map would be better:

some_data['map_example'] = some_data.col_b.map({
    'b': 'BBB',
    'z': 'ZZZ'
})
print(some_data)
>>>
   col_a col_b map_example
0      1     a         NaN
1      2     b         BBB
2      1     c         NaN
3      2     c         NaN
4      3     a         NaN
5      4     b         BBB
6      3     z         ZZZ
7      4     z         ZZZ

As you see, in all cases the values for which a condition is not specified evaluate to NaN.

Erik · Accepted Answer · 2020-06-24 06:14:42Z

0

You can use the apply function with a lambda in the following way:

df['Col_B'] = df['Col_A'].apply(lambda a: 'Y' if a == 'X' else 'N')

This creates the column Col_B on the dataframe df by looking at Col_A and giving either the values 'Y' if Col_A is 'X' and 'N' otherwise.

if your function is a bit more complex you can define it beforehand and call it in the apply function as follows:

def yes_or_no(x):
    if x == 'X':
        return 'Y'
    else:
        return 'N'
df['Col_B'] = df['Col_A'].apply(lambda a: yes_or_no(a))

answered Jun 24, 2020 at 6:14

Erik

8576 silver badges14 bronze badges

Comments

Serge Ballesta · Accepted Answer · 2020-06-24 06:29:56Z

A possible way to iterate over a dataframe by rows and change column values is:

make sure that there are no duplicated values in index (if there are, just use reset_index to get an acceptable index)

iterate over the index and access the individual values with at

 for ix in df.index:
     if df.at[ix, 'A'] == ...:
         df.at[ix, 'B'] = z

Alternatively, if you can access the columns by their positions instead of their names, you can use the even more efficient iat:

for i in range(len(df)):
    if df.iat[i, index_col_A] == ... :
        df.iat[i, index_col_B] = z

As you access directly the individual elements, you avoid the overhead of iterrows creating a Series per row, and can perform changes. AFAIK, it is the less bad way when you cannot use the vectorized Pandas or numpy methods.

Collectives™ on Stack Overflow

Python Pandas Dataframe - row level operations

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related