1

I am fairly new to Pandas and desperately need help to modify my dataframe by comparing consecutive rows within multiple group pairings:

Dataframe example =

idData idStation idCast Pressure

1   1   1   1505
2   1   1   1506
3   1   1   1507
4   1   1   1508
5   1   2   1505
6   1   2   1506
7   1   2   1503
8   1   2   1504
9   2   1   1505
10  2   1   1506
11  2   1   1507

etc

I want to delete any rows where the Pressure value for that row is less than those above it, for each idStation and idCast pair (i.e. rows where idData = 7 and 8 need to be deleted). I don't want to compare different idStation & idCast pairs - i.e. first iteration of the loop would look through the Pressure record for idStation = 1, idCast = 1 and the second iteration of the loop would look through the record for idStation = 1, idCast = 2 etc). I have tried grouping by idStation and idCast, then looping over them and comparing row by row but this only modifies a copy, not the original dataframe and the changes are lost.

Stn_Cast_Group = Dataframe.groupby(['idStation','idCast'])
for name, group in Stn_Cast_Group:
        j = 0
        for i in range(1,len(group['Pressure'])):                     
            if group['Pressure'].iloc[i] < j:
                group['Pressure'].iloc[i] = np.nan
            else:
                j = group['Pressure'].iloc[i]

This labels the correct pressure values as nan (as I am unsure of how to delete the row) but only for the group view.

How would you create a copy of the dataframe (so as to have access to original and modified versions) and then delete the rows as mentioned above?

1 Answer 1

3

You could use boolean indexing with the condition that for each group the Pressure value is not below the .cummax() for that group.

  1. Use .groupby() to work on unique subsets of ['idStation', 'idCast'].
  2. For each of these groups, check if the current 'Pressure' value is below the running max (cummax()), and return the inverse boolean (ie False if this the value is below). The result is returned from groupby() as .values.
  3. Use the result of 2. in boolean indexing to drop the rows that did meet the condition.

In pandas code, this looks like:

df[~df.groupby(['idStation', 'idCast']).apply(lambda x: x.Pressure < x.Pressure.cummax()).values]

to get:

    idData  idStation  idCast  Pressure
0        1          1       1      1505
1        2          1       1      1506
2        3          1       1      1507
3        4          1       1      1508
4        5          1       2      1505
5        6          1       2      1506
8        9          2       1      1505
9       10          2       1      1506
10      11          2       1      1507

You can also eliminate the .apply() step as suggested by @MaxU for a ~40% speed increase for the toy data:

df[df.Pressure >= df.groupby(['idStation','idCast'])['Pressure'].cummax()]
Sign up to request clarification or add additional context in comments.

3 Comments

I think this version (without apply) should be faster: df[df.Pressure >= df.groupby(['idStation','idCast'])['Pressure'].cummax()]
Amazing! That worked like a charm. How does the first half of the condition 'df.Pressure' specify that you are talking about each idStation/ idCast group - it looks as though you are just specifying the entire Pressure column?
The .groupby() does this. Added some explanation above.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.