0

i'm working with a large dataset and having trouble coding the conditions for the following task:

The following is an example similar to my own problem. I'm trying to calculate how quickly a substance travels through a medium. Each year and for each id, a substance is inserted into the medium. The goal is to calculate the "year of arrival" for each insertion. The travel distance of the substances within each medium has been calculated in [%] for every year.

My dataset looks similar to the following:

import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year= [2000,2001,2002,2003,2004,2005,2000,2001,2002,2003,2004,2005,2000,2001,2002,2003,2004,2005]
traveldistance = [120,70,37,40,50,110,140,100,90,5,52,80,60,40,70,60,50,110]

dictex ={"id":ids,"year of insertion":year,"travel distance [%]": traveldistance}
dfex = pd.DataFrame(dictex)

print(dfex)
    medium id  year of insertion  travel distance [%]
0           1               2000                  120
1           1               2001                   70
2           1               2002                   37
3           1               2003                   40
4           1               2004                   50
5           1               2005                  110
6           2               2000                  140
7           2               2001                  100
8           2               2002                   90
9           2               2003                    5
10          2               2004                   52
11          2               2005                   80
12          3               2000                   60
13          3               2001                   40
14          3               2002                   70
15          3               2003                   60
16          3               2004                   50
17          3               2005                  110

There are several conditions to be considered:

  1. A substance can't begin to travel in the year of its insertion into the medium (i.e. a substance inserted in year 2000 can only begin to travel in year 2001). Thus, in this example a substances inserted in year 2005 cannot reach the destination within the observed timeframe.
  2. The year of arrival is calculated via adding the travel distance [%] of the following years. Once a travel distance >= 100% is reached, the substance has arrived at its destination on the other side of the medium. That year is the year of arrival and should be added in a new column.
  3. If the travel distance doesnt reach 100% during the observed timespan the result should be [NaN]

Example:

a) For medium id == 1, the first year of insertion is year 2000. The substance then begins to travel in year 2001 and travels through 70 % of the medium. In year 2002 it travels another 37 %: 70+37 = 107% >= 100%, therefore the year of arrival for the first substance is year 2002.

b) In year 2001, the second substance is inserted in medium id == 1. It begins to travel in year 2002 and travels through 37 % of the medium. In year 2003 it travels through another 40 %: 37+40 = 77 % < 100% In year 2004 the substance travels through 50 % of the medium: 37+40+50 = 127% >=100%, therefore the year of arrival for the second substance is year 2004.

The result should look like this:

     medium id  year of insertion  travel distance [%]  Year of arrival
0           1               2000                  120           2002.0
1           1               2001                   70           2004.0
2           1               2002                   37           2005.0
3           1               2003                   40           2005.0
4           1               2004                   50           2005.0
5           1               2005                  110              NaN
6           2               2000                  140           2001.0
7           2               2001                  100           2004.0
8           2               2002                   90           2005.0
9           2               2003                    5           2005.0
10          2               2004                   52              NaN
11          2               2005                   80              NaN
12          3               2000                   60           2002.0
13          3               2001                   40           2003.0
14          3               2002                   70           2004.0
15          3               2003                   60           2005.0
16          3               2004                   50           2005.0
17          3               2005                  110              NaN

Any help would be much appreciated!

5
  • Can you elaborate on point (2)? How is year of arrival calculated? Steps with examples will help Commented Sep 11, 2020 at 15:43
  • Thanks for the feedback, i added an example for clarification! Commented Sep 11, 2020 at 16:08
  • notice your sample data has traveldistance = [120,70,37,40,20,... but the dataframe you print has 2004: 50 instead of that last 20, check the code before you post/edit a question Commented Sep 15, 2020 at 12:50
  • @RichieV Thank you for your answer! I edited the example once and must have missed that value. Going to try out your solution now, thank you for taking the time and effort to reply! Commented Sep 15, 2020 at 12:57
  • no worries, I like challenges as a training exercise, we all gain in the end Commented Sep 15, 2020 at 13:00

1 Answer 1

1

I'm not aware of any pandas built-in method that targets this specific case. But here's a solution with apply and some numpy handling.

def rolling_fwd_idx_over(df, group_by_cols, value_col, target_col, cutoff=100):
    def find_cross(group):
        travel = group[value_col].to_numpy()
        travel = np.broadcast_to(travel, (travel.size, travel.size))
        travel = np.triu(travel, 1).cumsum(axis=1)
        idx = np.argmax(travel >= cutoff, axis=1)
        out = np.where(
            travel[range(travel.shape[0]), idx] >= cutoff,
            group[target_col].to_numpy()[idx],
            np.nan
        )
        return out
    
    df['result'] = (
        df.groupby(group_by_cols).apply(find_cross).explode()
            .reset_index(drop=True)
    )
    return df

Use it as

dfex = rolling_fwd_idx_over(
    dfex, 'id', 'travel distance [%]', 'year of insertion')
dfex.rename(columns={'result': 'Year of arrival'}, inplace=True)

Output

    id  year of insertion  travel distance [%] Year of arrival
0    1               2000                  120            2002
1    1               2001                   70            2004
2    1               2002                   37            2005
3    1               2003                   40            2005
4    1               2004                   50            2005
5    1               2005                  110             NaN
6    2               2000                  140            2001
7    2               2001                  100            2004
8    2               2002                   90            2005
9    2               2003                    5            2005
10   2               2004                   52             NaN
11   2               2005                   80             NaN
12   3               2000                   60            2002
13   3               2001                   40            2003
14   3               2002                   70            2004
15   3               2003                   60            2005
16   3               2004                   50            2005
17   3               2005                  110             NaN
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.