0

I have a dataframe which contains two columns. One column shows distance, the other column contains unique 'trackIds' that are associated with a set of distances.

Example:

    trackId.      distance
    
      2.           17.452
      2.            8.650
      2.           10.392
      2.           11.667
      2.           23.551
      2.            9.881
      3.            6.052
      3.            7.241
      3.            8.459
      3.           22.644
      3.          126.890
      3.           12.442
      3.            5.891
      4.           44.781
      4.            7.657
      4.           36.781
      4.          224.001

What I am trying to do is eliminate any trackIds that contain a large spike in distance -- a spike that is > 75. In this example case, track Ids 3 and 4 (and all their associated distances) would be removed from the dataframe because we see spikes in distance greater than 75, thus we would just be left with a dataframe containing trackId 2 and all of its associated distance values.

Here is my code:

    i = 0
    k = 1
    length = len(dataframe)
    while i < length: 
        if (dataframe.distance[k] - dataframe.distance[i]) > 75: 
        bad_id = dataframe.trackId[k]
        condition = dataframe.trackid != bad_id
        df2 = dataframe[condition]
    i+=1

I tried to use a while loop that was able to go through all the different trackIds, subtract all the distance values and see if the result was > 75, if it was, then the program associated that trackId with the variable 'bad_id' and used that as a condition to filter the dataframe to only contain trackIds that are not equal to the bad_id(s).

I just keep getting nameErrors because I'm unsure of how to properly structure the loop and am in general not sure if this approach works anyways.

1 Answer 1

1

We can use diff to compute the difference between each row, then use groupby transform to check if there are any differences in the group gt 75. Then keep groups where there are not any matches:

m = ~(df['distance'].diff().gt(75).groupby(df['trackId']).transform('any'))
filtered_df = df.loc[m, df.columns]

filtered_df:

    trackId  distance
0       2.0    17.452
1       2.0     8.650
2       2.0    10.392
3       2.0    11.667
4       2.0    23.551
5       2.0     9.881

Breakdown of steps as a DataFrame:

breakdown = pd.DataFrame({'diff': df['distance'].diff()})
breakdown['gt 75'] = breakdown['diff'].gt(75)
breakdown['groupby any'] = (
    breakdown['gt 75'].groupby(df['trackId']).transform('any')
)
breakdown['negation'] = ~breakdown['groupby any']
print(breakdown)

breakdown:

       diff  gt 75  groupby any  negation
0       NaN  False        False      True
1    -8.802  False        False      True
2     1.742  False        False      True
3     1.275  False        False      True
4    11.884  False        False      True
5   -13.670  False        False      True
6    -3.829  False         True     False
7     1.189  False         True     False
8     1.218  False         True     False
9    14.185  False         True     False
10  104.246   True         True     False  # Spike of more than 75
11 -114.448  False         True     False
12   -6.551  False         True     False
13   38.890  False         True     False
14  -37.124  False         True     False
15   29.124  False         True     False
16  187.220   True         True     False  # Spike of more than 75
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.