0

I'm trying to perform an action on grouped data in Pandas. For each group, I want to loop through the rows and compare them to the first row in the group. If conditions are met, then I want to print out the row details. My data looks like this:

Orig  Dest  Route  Vol    Per   VolPct
ORD   ICN   A      2,251  0.64  0.78
ORD   ICN   B      366    0.97  0.13
ORD   ICN   C      142    0.14  0.05
DCA   FRA   A      9,059  0.71  0.85
DCA   FRA   B      1,348  0.92  0.13
DCA   FRA   C      281    0.8   0.03

My groups are Orig, Dest pairs. If a row in the group other than the first row has a Per greater than the first row and a VolPct greater than .1, I want to output the grouped pair and the route. In this example, the output would be:

ORD ICN B
DCA FRA B

My attempted code is as follows:

for lane in otp.groupby(otp['Orig','Dest']):
    X = lane.first(['Per'])
    for row in lane:
        if (row['Per'] > X and row['VolPct'] > .1):
            print(row['Orig','Dest','Route'])

However, this isn't working so I'm obviously not doing something right. I'm also not sure how to tell Python to ignore the first row when in the "row in lane" loop. Any ideas? Thanks!

1 Answer 1

2

You are pretty close as it is.

First, you are calling groupby incorrectly. You should just pass a list of the column names instead of a DataFrame object. So, instead of otp.groupby(otp['Orig','Dest']) you should use otp.groupby(['Orig','Dest']).

Once you are looping through the groups you will hit more issues. A group in a groupby object is actually a tuple. The first item in that tuple is the grouping key and the second is the grouped data. For example your first group would be the following tuple:

(('DCA', 'FRA'),   Orig Dest Route    Vol   Per  VolPct
 3  DCA  FRA     A  9,059  0.71    0.85
 4  DCA  FRA     B  1,348  0.92    0.13
 5  DCA  FRA     C    281  0.80    0.03)

You will need to change the way you set X to reflect this. For example, X = lane.first(['Per']) should become X = lane[1].iloc[0].Per. After that you only have a minor errors in the way you iterate through the rows and access multiple columns in a row. To wrap it all up your loop should be something like so:

for key, lane in otp.groupby(otp['Orig','Dest']):
    X = lane.iloc[0].Per
    for idx, row in lane.iterrows():
        if (row['Per'] > X and row['VolPct'] > .1):
            print(row[['Orig','Dest','Route']])

Note that I use iterrows to iterate through the rows, and I use double brackets when accessing multiple columns in a DataFrame.

You don't really need to tell pandas to ignore the first row in each group as it should never trigger your if statement, but if you did want to skip it you could use lane[1:].iterrows().

Sign up to request clarification or add additional context in comments.

2 Comments

This is great! Thank you for your help and the explanation. Why do you need to specify "key" and "idx" in the two for loops? Also when printing the columns for the row, why do you need to use double brackets?
No problem. key and idx aren't necessary as you don't use them in this case. You could just as easily replace them with _ (a commonly used variable for something ignored) or anything. As the values those lines are iterating through return tuples, those names serve as the placeholder for the first part of the tuple. In the case of groupby the tuple is (key, group) so i split it as such. If i had used your original syntax lane in groupby(...) i would have to access the group element of the lane tuple with lane[1]. For me descriptive is better so i go with the former.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.