0

I have a fairly specific algorithm I want to follow.

Basically I have a dataframe as follows:

        month   taken   score
1       1       2       23
2       1       1       34
3       1       2       12
4       1       2       59
5       2       1       12
6       2       2       23
7       2       1       43
8       2       2       45
9       3       1       43
10      3       2       43
11      4       1       23
12      4       2       94

I want to make it so that the 'score' column is changed to 100 on days where taken == 2 continuously until the end of that month. So, not all occurrences of taken == 2 have their score set to 100, if any day following during that month has a taken == 1.

So the result I'd want is:

        month   taken   score
1       1       2       23
2       1       1       34
3       1       2       100
4       1       2       100
5       2       1       12
6       2       2       23
7       2       1       43
8       2       2       100
9       3       1       43
10      3       2       43
11      3       1       23
12      3       2       100
13      4       1       32
14      4       2       100

I've written this code which I feel should do it:

#iterate through months
for month in range(12):
    #iterate through scores
    for score in range(len(df_report.loc[df_report['month'] == month+1])):
        #starting from the bottom, of that month, if 'taken' == 2...
        if df_report.loc[df_report.month==month+1, 'taken'].iloc[-score-1] == 2:
            #then set the score to 100
            df_report.loc[df_report.month==month+1, 'score'].iloc[-score-2] = 100
        #if you run into a 'taken' == 1, move on to next month
        else: break

However, this doesn't appear to change any values, despite not throwing an error... it also doesn't give me an error about setting values to a copied dataframe.

Could anyone explain what I'm doing wrong?

2
  • If I had to guess, it would be that you're setting a copy of a copy with a new value. Chaining *loc calls isn't the best idea. Commented Oct 20, 2017 at 10:06
  • I think you're right, but how can I solve this? Also, if .loc isn't a copy, and .iloc isn't a copy, then why is an .iloc of a .loc a copy?! Commented Oct 20, 2017 at 10:13

2 Answers 2

2

The reason for your values not being updated is that assignment to iloc updates the copy returned by the preceding loc call, so the original is not touched.


Here's how I'd tackle this. First, define a function foo.

def foo(df):
    for i in reversed(df.index):
        if df.loc[i, 'taken'] != 2:
            break
        df.loc[i, 'score'] = 100
        i -= 1
    return df

Now, groupby month and call foo:

df = df.groupby('month').apply(foo)
print(df) 
    month  taken  score
1       1      2     23
2       1      1     34
3       1      2    100
4       1      2    100
5       2      1     12
6       2      2     23
7       2      1     43
8       2      2    100
9       3      1     43
10      3      2    100
11      4      1     23
12      4      2    100

Obviously, apply has its shortcomings, but I cannot think of a vectorised approach to this problem.

Sign up to request clarification or add additional context in comments.

1 Comment

Me neither. I can get rid of for loop but not the approach with apply in groupby
2

You can do

import numpy as np
def get_value(x):
    s = x['taken']
    # Get a mask of duplicate sequeence and change values using np.where
    mask = s.ne(s.shift()).cumsum().duplicated(keep=False)
    news = np.where(mask,100,x['score'])

    # if last number is 2 then change the news value to 100
    if s[s.idxmax()] == 2: news[-1] = 100 
    return pd.Series(news)

df['score'] = df.groupby('month').apply(get_value).values

Output :

   month  taken  score
1       1      2     23
2       1      1     34
3       1      2    100
4       1      2    100
5       2      1     12
6       2      2     23
7       2      1     43
8       2      2    100
9       3      1     43
10      3      2    100
11      4      1     23
12      4      2    100

Almost identical speed but @coldspeed is winner

ndf = pd.concat([df]*10000).reset_index(drop=True)

%%timeit
ndf['score'] = ndf.groupby('month').apply(foo)
10 loops, best of 3: 40.8 ms per loop


%%timeit  
ndf['score'] = ndf.groupby('month').apply(get_value).values
10 loops, best of 3: 42.6 ms per loop

5 Comments

Sure this is better than a simple iteration?
I dont know need to check the speed
@cᴏʟᴅsᴘᴇᴇᴅ its very very wierd. The difference is 2 milliseconds no matter how big the dataframe goes in my PC.
This isn't the kind of data for which it is very easy to time, because simply concatenating smaller dfs to get a large df won't change things - you still have the same number of 2s at the end of each group. Hope I'm making sense.
Yeah I too am curious abt the real data.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.