i'm working with a large dataset and having trouble coding the conditions for the following task:
The following is an example similar to my own problem. I'm trying to calculate how quickly a substance travels through a medium. Each year and for each id, a substance is inserted into the medium. The goal is to calculate the "year of arrival" for each insertion. The travel distance of the substances within each medium has been calculated in [%] for every year.
My dataset looks similar to the following:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year= [2000,2001,2002,2003,2004,2005,2000,2001,2002,2003,2004,2005,2000,2001,2002,2003,2004,2005]
traveldistance = [120,70,37,40,50,110,140,100,90,5,52,80,60,40,70,60,50,110]
dictex ={"id":ids,"year of insertion":year,"travel distance [%]": traveldistance}
dfex = pd.DataFrame(dictex)
print(dfex)
medium id year of insertion travel distance [%]
0 1 2000 120
1 1 2001 70
2 1 2002 37
3 1 2003 40
4 1 2004 50
5 1 2005 110
6 2 2000 140
7 2 2001 100
8 2 2002 90
9 2 2003 5
10 2 2004 52
11 2 2005 80
12 3 2000 60
13 3 2001 40
14 3 2002 70
15 3 2003 60
16 3 2004 50
17 3 2005 110
There are several conditions to be considered:
- A substance can't begin to travel in the year of its insertion into the medium (i.e. a substance inserted in year 2000 can only begin to travel in year 2001). Thus, in this example a substances inserted in year 2005 cannot reach the destination within the observed timeframe.
- The year of arrival is calculated via adding the travel distance [%] of the following years. Once a travel distance >= 100% is reached, the substance has arrived at its destination on the other side of the medium. That year is the year of arrival and should be added in a new column.
- If the travel distance doesnt reach 100% during the observed timespan the result should be [NaN]
Example:
a) For medium id == 1, the first year of insertion is year 2000. The substance then begins to travel in year 2001 and travels through 70 % of the medium. In year 2002 it travels another 37 %: 70+37 = 107% >= 100%, therefore the year of arrival for the first substance is year 2002.
b) In year 2001, the second substance is inserted in medium id == 1. It begins to travel in year 2002 and travels through 37 % of the medium. In year 2003 it travels through another 40 %: 37+40 = 77 % < 100% In year 2004 the substance travels through 50 % of the medium: 37+40+50 = 127% >=100%, therefore the year of arrival for the second substance is year 2004.
The result should look like this:
medium id year of insertion travel distance [%] Year of arrival
0 1 2000 120 2002.0
1 1 2001 70 2004.0
2 1 2002 37 2005.0
3 1 2003 40 2005.0
4 1 2004 50 2005.0
5 1 2005 110 NaN
6 2 2000 140 2001.0
7 2 2001 100 2004.0
8 2 2002 90 2005.0
9 2 2003 5 2005.0
10 2 2004 52 NaN
11 2 2005 80 NaN
12 3 2000 60 2002.0
13 3 2001 40 2003.0
14 3 2002 70 2004.0
15 3 2003 60 2005.0
16 3 2004 50 2005.0
17 3 2005 110 NaN
Any help would be much appreciated!
traveldistance = [120,70,37,40,20,...but the dataframe you print has2004: 50instead of that last20, check the code before you post/edit a question