iterating over multiple columns and rows in pandas dataframe

Question

I am fairly new to Pandas and desperately need help to modify my dataframe by comparing consecutive rows within multiple group pairings:

Dataframe example =

idData idStation idCast Pressure

1   1   1   1505
2   1   1   1506
3   1   1   1507
4   1   1   1508
5   1   2   1505
6   1   2   1506
7   1   2   1503
8   1   2   1504
9   2   1   1505
10  2   1   1506
11  2   1   1507

etc

I want to delete any rows where the Pressure value for that row is less than those above it, for each idStation and idCast pair (i.e. rows where idData = 7 and 8 need to be deleted). I don't want to compare different idStation & idCast pairs - i.e. first iteration of the loop would look through the Pressure record for idStation = 1, idCast = 1 and the second iteration of the loop would look through the record for idStation = 1, idCast = 2 etc). I have tried grouping by idStation and idCast, then looping over them and comparing row by row but this only modifies a copy, not the original dataframe and the changes are lost.

Stn_Cast_Group = Dataframe.groupby(['idStation','idCast'])
for name, group in Stn_Cast_Group:
        j = 0
        for i in range(1,len(group['Pressure'])):                     
            if group['Pressure'].iloc[i] < j:
                group['Pressure'].iloc[i] = np.nan
            else:
                j = group['Pressure'].iloc[i]

This labels the correct pressure values as nan (as I am unsure of how to delete the row) but only for the group view.

How would you create a copy of the dataframe (so as to have access to original and modified versions) and then delete the rows as mentioned above?

Stefan · Accepted Answer · 2016-06-21 11:17:51Z

3

You could use boolean indexing with the condition that for each group the Pressure value is not below the .cummax() for that group.

Use .groupby() to work on unique subsets of ['idStation', 'idCast'].
For each of these groups, check if the current 'Pressure' value is below the running max (cummax()), and return the inverse boolean (ie False if this the value is below). The result is returned from groupby() as .values.
Use the result of 2. in boolean indexing to drop the rows that did meet the condition.

In pandas code, this looks like:

df[~df.groupby(['idStation', 'idCast']).apply(lambda x: x.Pressure < x.Pressure.cummax()).values]

to get:

    idData  idStation  idCast  Pressure
0        1          1       1      1505
1        2          1       1      1506
2        3          1       1      1507
3        4          1       1      1508
4        5          1       2      1505
5        6          1       2      1506
8        9          2       1      1505
9       10          2       1      1506
10      11          2       1      1507

You can also eliminate the .apply() step as suggested by @MaxU for a ~40% speed increase for the toy data:

df[df.Pressure >= df.groupby(['idStation','idCast'])['Pressure'].cummax()]

edited Jun 21, 2016 at 11:17

answered Jun 20, 2016 at 15:27

Stefan

43.1k13 gold badges80 silver badges84 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

MaxU - stand with Ukraine Over a year ago

I think this version (without apply) should be faster: df[df.Pressure >= df.groupby(['idStation','idCast'])['Pressure'].cummax()]

G.Peach Over a year ago

Amazing! That worked like a charm. How does the first half of the condition 'df.Pressure' specify that you are talking about each idStation/ idCast group - it looks as though you are just specifying the entire Pressure column?

Stefan Over a year ago

The .groupby() does this. Added some explanation above.

Collectives™ on Stack Overflow

iterating over multiple columns and rows in pandas dataframe

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related