2

I'm trying to iterate through two python dataframe columns to determine specific values and then add the results to a new column. The code below is throwing the following error:

raise ValueError('Length of values does not match length of ' 'index')" 

I'm not sure why?

Dataframe:

    TeamID    todayorno
1   sw        True
2   pr        False
3   sw        False
4   pr        True

Code:

team = []

for row in results['TeamID']:   
    if row == "sw":
        for r in results['todayorno']:
            if r == True:
                team.append('red')
            else:
                team.append('green')
    else:
        team.append('green')

results['newnew'] = team  
1
  • You gave an example of how the dataframe looks before your code runs. Can you give an example of how you want the dataframe to look after your code runs? Commented Jun 14, 2018 at 20:27

2 Answers 2

2

You are iterating your dataframe twice, indicated by the fact you have 2 for loops. You end up with a result of 10 items instead of the required 4.

Explicit iteration is not required. You can use numpy.select to apply values for specified conditions.

import numpy as np

mask = results['TeamID'] == 'sw'
conditions = [~mask, mask & results['todayorno'], mask & ~results['todayorno']]
values = ['green', 'red', 'green']

results['newnew'] = np.select(conditions, values, 'green')

print(results)

  TeamID  todayorno newnew
1     sw       True    red
2     pr      False  green
3     sw      False  green
4     pr       True  green
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you jpp! This is working exactly as intended. I'm going to be reading up on numpy! :D
can you explain how this line works? conditions = [~mask, mask & results['todayorno'], mask & ~results['todayorno']] ---- let me take a stab: are you selecting all inverse of the mask, mask with todayorno, and mask with inverse selection of todayorno? then assigning the same order as green, red, green?
~ means negative, & means "and" or "intersection". Each item of conditions matches to an item in values. For example, ~mask matches with 'green' (the first element in each list).
0

Quick answer

Don't try to loop.

Instead, create the new column with a default value (i.e. the most common), and then address the values you want to change and set them:

>>> results
  TeamID  todayorno
0     sw       True
1     pr      False
2     sw      False
3     pr       True
>>> results['newnew'] = 'green'
>>> results
  TeamID  todayorno newnew
0     sw       True  green
1     pr      False  green
2     sw      False  green
3     pr       True  green
>>> results.loc[(results['TeamID'] == 'sw') & (results['todayorno']), 'newnew'] = 'red'
>>> results
  TeamID  todayorno newnew
0     sw       True    red
1     pr      False  green
2     sw      False  green
3     pr       True  green

Alternatively, you can use .apply(..., index=1) to calculate a whole series with a function that looks at each row, and assign the whole series at once as a column:

>>> results
  TeamID  todayorno
0     sw       True
1     pr      False
2     sw      False
3     pr       True
>>> results['newnew'] = results.apply(
...     lambda s: 'red' if s['TeamID'] == 'sw' and s['todayorno'] else 'green',
...     axis=1,
... )
>>> results
  TeamID  todayorno newnew
0     sw       True    red
1     pr      False  green
2     sw      False  green
3     pr       True  green

Explanation

The problem

As far as I can tell from your code, you're trying to Add a column to your dataframe called newnew.

In the rows of the dataframe where the TeamID column contains the value "sw" and the column todayorno contains the value True, you want the column newnew to contain the value "red".

In all other rows, you want the value of newnew to be "green".

A rule

To work efficiently with pandas, a very important rule is: don't try to loop. Especially through the rows.

Instead get pandas to do the work for you.

So, the first step is to create the new column. And since in most cases you want the value to be "green", you can simply do:

results['newnew'] = 'green'

Now your dataframe looks like:

  TeamID  todayorno newnew
0     sw       True  green
1     pr      False  green
2     sw      False  green
3     pr       True  green

You'll notice that pandas "expanded" the single value provided through all the rows.

Now to get the sw/True rows to be "red", first you need to locate them all. For this we need to understand how pandas addressing work.

(A little bit of) How pandas addressing works

When you use square brackets after a pandas dataframe, you are, in general, addressing the columns of your dataframe. Ex:

>>> results['TeamID']
0    sw
1    pr
2    sw
3    pr
Name: TeamID, dtype: object

I.e. by requesting the TeamID index of the results dataframe, you got back a Series called TeamID containing only the values of that column.

On the other hand, if you want to address rows, you need to use the .loc property.

>>> results.loc[1]
TeamID          pr
todayorno    False
newnew       green
Name: 1, dtype: object

Here we got back a Series containing the values of the row.

If we want to see multiple rows, we can get a sub-dataframe by indexing a list of rows:

>>> results.loc[[1,2]]
  TeamID  todayorno newnew
1     pr      False  green
2     sw      False  green

Or by using a condition:

>>> results.loc[results['TeamID'] == 'pr']
  TeamID  todayorno newnew
1     pr      False  green
3     pr       True  green

The condition can contain boolean combinations, but the syntax for that has special requirements, like using & instead of and and carefully wrapping the parts of the condition with parentheses due to the precedence of the & operator:

>>> results.loc[(results['TeamID'] == 'sw') & (results['todayorno'])]
  TeamID  todayorno newnew
1     sw       True  green

The .loc property can also address by both rows and columns. A comma separates the addressing parts where the addressing of rows comes first and the columns last:

>>> results.loc[results['TeamID'] == 'pr', 'todayorno']
1    False
3     True
Name: todayorno, dtype: bool

The final touch

And the .loc property can be used for assignments as well, by assigning the value you want to the desired "coordinates".

So in your case:

>>> results.loc[
...     (results['TeamID'] == 'sw') & (results['todayorno']),
...     'newnew'
... ] = "red"
>>> results
  TeamID  todayorno newnew
0     sw       True    red
1     pr      False  green
2     sw      False  green
3     pr       True  green

The other solution

The .apply() method of dataframes allows applying a single function multiple times, either column-wise or row-wise. To apply row-wise, pass the axis=1 parameter.

If the result of the function passed to .apply(..., axis=1) returns a single value, then the result of each application of the function will be combined in a Series with the same addressing (the same index, in pandas parlance) of the rows of the dataframe.

So:

>>> results.apply(
...     lambda s: 'red' if s['TeamID'] == 'sw' and s['todayorno'] else 'green',
...     axis=1,
... )
0      red
1    green
2    green
3    green
dtype: object

This can then be assigned as a column of the dataframe:

>>> results['newnew'] = results.apply(
...     lambda s: 'red' if s['TeamID'] == 'sw' and s['todayorno'] else 'green',
...     axis=1,
... )
>>> results
  TeamID  todayorno newnew
0     sw       True    red
1     pr      False  green
2     sw      False  green
3     pr       True  green

1 Comment

Note pd.Series.apply + lambda should be used as a very last resort where vectorised solutions are not possible. It is far more efficient to use Boolean series for indexing than row-wise ternary statements.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.