pandas use Boolean mask replace row iteration in dataframe

Question

I have the following df,

days    days_1    days_2    period    percent_1   percent_2    amount
3       5         4         1         0.2         0.1         100
2       1         3         4         0.3         0.1         500
9       8         10        6         0.4         0.2         600
10      7         8         11        0.5         0.3         700
10      5         6         7         0.7         0.4         800

I have the following logic that applies to each row of the df,

for each row in df:
    if days < days_1:
        amount_missed = 0
        days_missed = 0
    elif days_1 < days < days_2:
        missed_percent = percent_1 - percent_2
        amount_missed = amount * (missed_percent / 100)
        days_missed = days - days_1    
    elif days_2 < days < period or days > period:    
        missed_percent = percent_2
        amount_missed = amount * (missed_percent / 100)
        days_missed = days - days_2
    else:
        amount_missed = 0
        days_missed = 0

I am trying to use boolean mask and np.where to translate the above logic as follows,

cond1 = df['days_2'] < df['days']
cond2 = df['days'] < df['period']
cond3 = df['days'] > df['period']
cond4 = df['days'] >= df['days_1']
cond5 = df['days'] < df['days_2']
cond6 = df['days'] > df['days_1']

mask = ((cond1 & cond2) | cond3) & cond4
mask2 = cond5 & cond6

df['amount_missed'] = np.where(mask, df['amount'] * df['percent_2'] / 100, 0.0)
df['amount_missed'] = np.where(mask2, df['amount'] * (df['percent_1'] - df['percent_2']) / 100, 0.0)

df['days_missed'] = np.where(mask, df['days'] - df['days_2'], 0)
df['days_missed'] = np.where(mask2, df['days'] -df['days_1'], 0)

but the result of above code is not the same as the row iteration one, which should be,

{
 'amount_missed': {0: 0.0, 1: 1.0, 2: 1.2, 3: 2.1, 4: 3.2},
 'days_missed': {0: 0, 1: 1, 2: 1, 3: 2, 4: 4}
 }

the boolean mask one generates the following result,

{
 'amount_missed': {0: 0.0, 1: 0.9999999999999999, 2: 1.2, 3: 0.0, 4: 0.0},
 'days_missed': {0: 0, 1: 1, 2: 1, 3: 0, 4: 0}
 }

I am wondering how to fix it, and maybe there are other ways to replace df row iteration here.

The code you provided with the explicit loop also does not provide the output which you say it should provide. I assume the 7th line should be changed to set the value of 'amount_missed' instead of 'amount', but even then results are still different — Dennis Soemers
– Dennis Soemers, Commented Jan 23, 2018 at 11:18
Simplify! Show us just ONE output array that differs, with the code for just that one, and let's debug that. No need to show us 9 different arrays, some of which have no errors. — John Zwinck
– John Zwinck, Commented Jan 23, 2018 at 12:05

John Zwinck · Accepted Answer · 2018-01-23 12:34:14Z

2

The root cause of the bug is overwriting the target variables each time with a new np.where(), rather than cascading the where() expressions. But better than cascading where() expressions is np.select():

c0 = df.days < df.days_1
c1 = (df.days_1 < df.days) & (df.days < df.days_2)
c2 = ((df.days_2 < df.days) & (df.days < df.period)) | (df.days > df.period)

df['days_missed'] = np.select([c0, c1, c2], [0, df.days - df.days_1, df.days - df.days_2])

answered Jan 23, 2018 at 12:34

John Zwinck

252k44 gold badges346 silver badges459 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jezrael · Accepted Answer · 2018-01-23 12:40:14Z

2

Code used to generate the original dataframe (from the original, unedited question):

df = pd.DataFrame({
    'days': [3, 2, 9, 10, 10],
    'days_1': [5, 1, 8, 7, 5],
    'days_2': [4, 3, 10, 8, 6],
    'period': [1, 4, 6, 11, 7],
    'percent_1': [0.2, 0.3, 0.4, 0.5, 0.7],
    'percent_2': [0.1, 0.1, 0.2, 0.3, 0.4],
    'amount': [100, 500, 600, 700, 800]
}, columns=['days', 'days_1', 'days_2', 'period', 'percent_1', 'percent_2', 'amount'])

The following code provides the results you wanted in your original question (not updated for the simplified case you created after being asked to do so in comments):

df['amount_missed'] = np.where((df['days_1'] < df['days']) & (df['days'] < df['days_2']),
                               df['amount'] * (df['percent_1'] - df['percent_2']) / 100,
                               np.where((df['days_2'] < df['days']) & (df['days'] < df['period']),
                                        df['amount'] * (df['percent_2']) / 100,
                                        0.0))

df['days_missed'] = np.where((df['days_1'] < df['days']) & (df['days'] < df['days_2']),
                             df['days'] - df['days_1'],
                             np.where((df['days_2'] < df['days']) & (df['days'] < df['period']),
                                      df['days'] - df['days_2'],
                                      0))

Output:

   days  days_1  days_2  period  percent_1  percent_2  amount  amount_missed  \
0     3       5       4       1        0.2        0.1     100            0.0   
1     2       1       3       4        0.3        0.1     500            1.0   
2     9       8      10       6        0.4        0.2     600            1.2   
3    10       7       8      11        0.5        0.3     700            2.1   
4    10       5       6       7        0.7        0.4     800            0.0   

   days_missed  
0            0  
1            1  
2            1  
3            2  
4            0

EDIT:

Same answer with numpy.select:

m1 = (df['days_1'] < df['days']) & (df['days'] < df['days_2'])
s1 = df['amount'] * (df['percent_1'] - df['percent_2']) / 100
s11 = df['days'] - df['days_1']

m2 = (df['days_2'] < df['days']) & (df['days'] < df['period'])
s2 = df['amount'] * (df['percent_2']) / 100
s22 = df['days'] - df['days_2']

df['amount_missed'] = np.select([m1, m2], [s1, s2], default=0)
df['days_missed'] =   np.select([m1, m2], [s11, s22], default=0)

edited Jan 23, 2018 at 12:40

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

answered Jan 23, 2018 at 12:26

Dennis Soemers

8,5882 gold badges40 silver badges59 bronze badges

4 Comments

jezrael Over a year ago

hmmm, np.select here should be nicer ;)

Dennis Soemers Over a year ago

@jezrael I agree this may not necessarily be the cleanest solution. I partially decided to figure out how to answer the question because I was interested in learning how to work with np.where myself, but would definitely also be interested in cleaner solutions! I edited the code used to generate the original dataframe into my answer, maybe that'll be useful if you decide to also write an answer with a potentially cleaner solution

jezrael Over a year ago

Can I rewrite your solution and add it to your answer? :)

Dennis Soemers Over a year ago

@jezrael Sure. I see John also just posted a solution with np.select already though

Collectives™ on Stack Overflow

pandas use Boolean mask replace row iteration in dataframe

2 Answers 2

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related