Setting pandas dataframe value based on row and column conditions

Question

I have a fairly specific algorithm I want to follow.

Basically I have a dataframe as follows:

        month   taken   score
1       1       2       23
2       1       1       34
3       1       2       12
4       1       2       59
5       2       1       12
6       2       2       23
7       2       1       43
8       2       2       45
9       3       1       43
10      3       2       43
11      4       1       23
12      4       2       94

I want to make it so that the 'score' column is changed to 100 on days where taken == 2 continuously until the end of that month. So, not all occurrences of taken == 2 have their score set to 100, if any day following during that month has a taken == 1.

So the result I'd want is:

        month   taken   score
1       1       2       23
2       1       1       34
3       1       2       100
4       1       2       100
5       2       1       12
6       2       2       23
7       2       1       43
8       2       2       100
9       3       1       43
10      3       2       43
11      3       1       23
12      3       2       100
13      4       1       32
14      4       2       100

I've written this code which I feel should do it:

#iterate through months
for month in range(12):
    #iterate through scores
    for score in range(len(df_report.loc[df_report['month'] == month+1])):
        #starting from the bottom, of that month, if 'taken' == 2...
        if df_report.loc[df_report.month==month+1, 'taken'].iloc[-score-1] == 2:
            #then set the score to 100
            df_report.loc[df_report.month==month+1, 'score'].iloc[-score-2] = 100
        #if you run into a 'taken' == 1, move on to next month
        else: break

However, this doesn't appear to change any values, despite not throwing an error... it also doesn't give me an error about setting values to a copied dataframe.

Could anyone explain what I'm doing wrong?

If I had to guess, it would be that you're setting a copy of a copy with a new value. Chaining *loc calls isn't the best idea. — cs95
– cs95, Commented Oct 20, 2017 at 10:06
I think you're right, but how can I solve this? Also, if .loc isn't a copy, and .iloc isn't a copy, then why is an .iloc of a .loc a copy?! — James
– James, Commented Oct 20, 2017 at 10:13

cs95 · Accepted Answer · 2017-10-20 10:24:54Z

2

The reason for your values not being updated is that assignment to iloc updates the copy returned by the preceding loc call, so the original is not touched.

Here's how I'd tackle this. First, define a function foo.

def foo(df):
    for i in reversed(df.index):
        if df.loc[i, 'taken'] != 2:
            break
        df.loc[i, 'score'] = 100
        i -= 1
    return df

Now, groupby month and call foo:

df = df.groupby('month').apply(foo)
print(df) 
    month  taken  score
1       1      2     23
2       1      1     34
3       1      2    100
4       1      2    100
5       2      1     12
6       2      2     23
7       2      1     43
8       2      2    100
9       3      1     43
10      3      2    100
11      4      1     23
12      4      2    100

Obviously, apply has its shortcomings, but I cannot think of a vectorised approach to this problem.

answered Oct 20, 2017 at 10:24

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Bharath M Shetty Over a year ago

Me neither. I can get rid of for loop but not the approach with apply in groupby

Bharath M Shetty · Accepted Answer · 2017-10-20 10:43:56Z

2

You can do

import numpy as np
def get_value(x):
    s = x['taken']
    # Get a mask of duplicate sequeence and change values using np.where
    mask = s.ne(s.shift()).cumsum().duplicated(keep=False)
    news = np.where(mask,100,x['score'])

    # if last number is 2 then change the news value to 100
    if s[s.idxmax()] == 2: news[-1] = 100 
    return pd.Series(news)

df['score'] = df.groupby('month').apply(get_value).values

Output :

   month  taken  score
1       1      2     23
2       1      1     34
3       1      2    100
4       1      2    100
5       2      1     12
6       2      2     23
7       2      1     43
8       2      2    100
9       3      1     43
10      3      2    100
11      4      1     23
12      4      2    100

Almost identical speed but @coldspeed is winner

ndf = pd.concat([df]*10000).reset_index(drop=True)

%%timeit
ndf['score'] = ndf.groupby('month').apply(foo)
10 loops, best of 3: 40.8 ms per loop


%%timeit  
ndf['score'] = ndf.groupby('month').apply(get_value).values
10 loops, best of 3: 42.6 ms per loop

edited Oct 20, 2017 at 10:43

answered Oct 20, 2017 at 10:30

Bharath M Shetty

30.6k6 gold badges65 silver badges111 bronze badges

5 Comments

cs95 Over a year ago

Sure this is better than a simple iteration?

Bharath M Shetty Over a year ago

I dont know need to check the speed

Bharath M Shetty Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ its very very wierd. The difference is 2 milliseconds no matter how big the dataframe goes in my PC.

cs95 Over a year ago

This isn't the kind of data for which it is very easy to time, because simply concatenating smaller dfs to get a large df won't change things - you still have the same number of 2s at the end of each group. Hope I'm making sense.

Bharath M Shetty Over a year ago

Yeah I too am curious abt the real data.

Collectives™ on Stack Overflow

Setting pandas dataframe value based on row and column conditions

2 Answers 2

1 Comment

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related