I want to populate the empty columns 'web' 'mob 'app' by summing for each of the relevant dates in df2
df1:
id start end web mob app
12345 2018-01-17 2018-01-20
12346 2018-01-19 2018-01-22
12347 2018-01-20 2018-01-23
12348 2018-01-20 2018-01-23
12349 2018-01-21 2018-01-24
df2:
id date web mob app
12345 2018-01-17 7 17 10
12345 2018-01-18 9 18 7
12345 2018-01-19 3 19 15
12345 2018-01-20 6 17 8
12345 2018-01-21 8 9 13
12345 2018-01-22 4 15 12
12345 2018-01-23 8 11 13
12345 2018-01-24 9 16 14
12346 2018-01-17 3 17 12
12346 2018-01-18 4 19 4
12346 2018-01-19 6 13 10
12346 2018-01-20 1 15 6
12346 2018-01-21 4 12 11
12346 2018-01-22 5 20 12
12346 2018-01-23 8 13 14
12346 2018-01-24 6 18 8
This for loop will populate the 'web' column:
column = []
for i in df1.index:
column.append(df2[(df2['date'] >= df1['start'].iloc[i])
& (df2['date'] <= df1['end'].iloc[i])
& (df2['id'] == df1['id'].iloc[i])].sum()['web'])
df1['web'] = column
I want to be able to populate all 3 columns with one for loop, rather than doing 3 separate loops.
I have a feeling that using something like appending this
.agg({'web':'sum', 'mob':'sum', 'app':'sum'})
to a 2 dimensional list could be the answer.
Also... is there a more efficient way to do this than using for loops? Maybe by using numpy.where? I'm finding that running multiple for loops over large data sets can be very very slow.