4

I want to populate the empty columns 'web' 'mob 'app' by summing for each of the relevant dates in df2

df1:

id      start       end         web mob app
12345   2018-01-17  2018-01-20
12346   2018-01-19  2018-01-22
12347   2018-01-20  2018-01-23
12348   2018-01-20  2018-01-23
12349   2018-01-21  2018-01-24

df2:

id      date        web mob app
12345   2018-01-17  7   17  10
12345   2018-01-18  9   18  7
12345   2018-01-19  3   19  15
12345   2018-01-20  6   17  8
12345   2018-01-21  8   9   13
12345   2018-01-22  4   15  12
12345   2018-01-23  8   11  13
12345   2018-01-24  9   16  14
12346   2018-01-17  3   17  12
12346   2018-01-18  4   19  4
12346   2018-01-19  6   13  10
12346   2018-01-20  1   15  6
12346   2018-01-21  4   12  11
12346   2018-01-22  5   20  12
12346   2018-01-23  8   13  14
12346   2018-01-24  6   18  8

This for loop will populate the 'web' column:

column = []

for i in df1.index:
    column.append(df2[(df2['date'] >= df1['start'].iloc[i]) 
        & (df2['date'] <= df1['end'].iloc[i]) 
        & (df2['id'] == df1['id'].iloc[i])].sum()['web'])

df1['web'] = column

I want to be able to populate all 3 columns with one for loop, rather than doing 3 separate loops.

I have a feeling that using something like appending this

.agg({'web':'sum', 'mob':'sum', 'app':'sum'})

to a 2 dimensional list could be the answer.

Also... is there a more efficient way to do this than using for loops? Maybe by using numpy.where? I'm finding that running multiple for loops over large data sets can be very very slow.

2 Answers 2

5

IIUC

s=df1.merge(df2,on='id',how='left')
output=s[(s.start<=s.date)&(s.end>=s.date)].groupby('id').sum()
output
Out[991]: 
        web   mob   app
id                     
12345  25.0  71.0  40.0
12346  16.0  60.0  39.0

Then we using merge again

df1.merge(output.reset_index(),how='left').fillna(0)
Out[995]: 
      id      start        end   web   mob   app
0  12345 2018-01-17 2018-01-20  25.0  71.0  40.0
1  12346 2018-01-19 2018-01-22  16.0  60.0  39.0
2  12347 2018-01-20 2018-01-23   0.0   0.0   0.0
3  12348 2018-01-20 2018-01-23   0.0   0.0   0.0
4  12349 2018-01-21 2018-01-24   0.0   0.0   0.0
Sign up to request clarification or add additional context in comments.

1 Comment

Perfect! that's much better. i was over complicating things with the loops
0

This is one way, but it is not "pandonic". It assumes your date columns are already converted to datetime. But use @Wen's vectorised solution.

def filtersum(row):

    result = [(w, m, a) for i, w, m, a, d  in \
              zip(df2.id, df2.web, df2.mob, df2.app, df2.date) \
              if i == row['id'] and (row['start'] <= d <= row['end'])]

    return [sum(i) for i in (zip(*result))] if result else [0, 0, 0]

df1[['web', 'mob', 'app']] = df1.apply(filtersum, axis=1)

#       id      start        end  web  mob  app
# 0  12345 2018-01-17 2018-01-20   25   71   40
# 1  12346 2018-01-19 2018-01-22   16   60   39
# 2  12347 2018-01-20 2018-01-23    0    0    0
# 3  12348 2018-01-20 2018-01-23    0    0    0
# 4  12349 2018-01-21 2018-01-24    0    0    0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.