0

I need to do a large number of row level operations (a few pages of code) on a table of data.

E.g. if row.Col_A == 'X' : row.Col_B = 'Y'

I believe iterrows isn't appropriate for altering table values. So I've converted the table to a list of DotMap dictionaries. With this I can loop over the list and for each dictionary (row), write the code as above and the alterations are saved.

Is it possible to do this with the data as a DataFrame ?

There is a lot of logic and I think its clearest written this way so I'd prefer not to use map or apply functions.

9
  • I like to use for i, j in zip(df['A'], df['B']): if i == 1: j == 2 etc... You can loop through multiple columns in parallel with zip. Commented Jun 24, 2020 at 6:11
  • The underlying storage of a DataFrame is a collection of numpy arrays, so you can iterate over that. It is recommended not to do so for performance reasons, because Python loops are much slower than Pandas or numpy methods, but it is not worse than iterating over any other Python container (provided you directly access the dataframe and not a view over it so, you are true, iterrows should be avoided). Commented Jun 24, 2020 at 6:17
  • @SergeBallesta I agree, but there are situations where you cannot avoid looping, although, I think sometimes it is inevitable. I think that creating functions, apply(lambda x), pandas/numpy vectorized methods and list comprehension can do 95% of the job. Commented Jun 24, 2020 at 6:31
  • I have 20 columns and 20 operations to apply. Can this be extended to all columns. Something like for *df.columns in df.itertuples(): Commented Jun 24, 2020 at 6:32
  • @LincolnHannah as an alternative to looping. the Vectorized np.where() is one of the most powerful methods that I use constantly. I would look into that. Commented Jun 24, 2020 at 6:33

3 Answers 3

1

Let's have the following example dataframe:

import pandas as pd
import numpy as np

some_data = pd.DataFrame({
    'col_a': [1, 2, 1, 2, 3, 4, 3, 4],
    'col_b': ['a', 'b', 'c', 'c', 'a', 'b', 'z', 'z']
})

We want to create a new column based on one (or more) of the existing columns' values.

In case you have only two options, I would suggest using numpy.where like this:

some_data['np_where_example'] = np.where(some_data.col_a < 3, 'less_than_3', 'greater_than_3')
print(some_data)
>>>
   col_a col_b           col_c map_example np_where_example  \
0      1     a     less_than_3         NaN      less_than_3   
1      2     b     less_than_3         BBB      less_than_3   
2      1     c     less_than_3         NaN      less_than_3   
3      2     c     less_than_3         NaN      less_than_3   
4      3     a  greater_than_3         NaN   greater_than_3   
5      4     b  greater_than_3         BBB   greater_than_3   
6      3     z  greater_than_3         ZZZ   greater_than_3   
7      4     z  greater_than_3         ZZZ   greater_than_3 

# multiple conditions
some_data['np_where_multiple_conditions'] = np.where(((some_data.col_a >= 3) & (some_data.col_b == 'z')),
                                                     'is_true',
                                                     'is_false')
print(some_data)
>>>
   col_a col_b np_where_multiple_conditions
0      1     a                     is_false
1      2     b                     is_false
2      1     c                     is_false
3      2     c                     is_false
4      3     a                     is_false
5      4     b                     is_false
6      3     z                      is_true
7      4     z                      is_true

In case you have many options, then pandas.map would be better:

some_data['map_example'] = some_data.col_b.map({
    'b': 'BBB',
    'z': 'ZZZ'
})
print(some_data)
>>>
   col_a col_b map_example
0      1     a         NaN
1      2     b         BBB
2      1     c         NaN
3      2     c         NaN
4      3     a         NaN
5      4     b         BBB
6      3     z         ZZZ
7      4     z         ZZZ

As you see, in all cases the values for which a condition is not specified evaluate to NaN.

Sign up to request clarification or add additional context in comments.

1 Comment

I like np.where(). I feel like it is often overlooked.
0

You can use the apply function with a lambda in the following way:

df['Col_B'] = df['Col_A'].apply(lambda a: 'Y' if a == 'X' else 'N')

This creates the column Col_B on the dataframe df by looking at Col_A and giving either the values 'Y' if Col_A is 'X' and 'N' otherwise.

if your function is a bit more complex you can define it beforehand and call it in the apply function as follows:

def yes_or_no(x):
    if x == 'X':
        return 'Y'
    else:
        return 'N'
df['Col_B'] = df['Col_A'].apply(lambda a: yes_or_no(a))

Comments

0

A possible way to iterate over a dataframe by rows and change column values is:

  1. make sure that there are no duplicated values in index (if there are, just use reset_index to get an acceptable index)

  2. iterate over the index and access the individual values with at

     for ix in df.index:
         if df.at[ix, 'A'] == ...:
             df.at[ix, 'B'] = z
    

Alternatively, if you can access the columns by their positions instead of their names, you can use the even more efficient iat:

for i in range(len(df)):
    if df.iat[i, index_col_A] == ... :
        df.iat[i, index_col_B] = z

As you access directly the individual elements, you avoid the overhead of iterrows creating a Series per row, and can perform changes. AFAIK, it is the less bad way when you cannot use the vectorized Pandas or numpy methods.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.