Using pandas dataframe, how to group by multiple columns and adding new column with missing data

Question

I want to group a 6-column dataframe for all rows with the same values in the first 3 columns, and then i want to add a new column with the value of the last column where the value of the 4th column = 0.

The original dataframe looks like this:

          A         B     C  D           E   F    G
 0    11018  20190102     0  0  1546387200  37   34
 1    11018  20190102     0  1  1546390800  33   36
 2    11018  20190102     0  2  1546394400  19   19
 3    11018  20190102     0  3  1546398000  17   26
 4    11018  20190102     0  4  1546401600  16   26
 5    11018  20190102     0  5  1546405200  13   23
 6    11018  20190102     0  6  1546408800  11   15
 7    11018  20190102  1200  0  1546430400  25   24
 8    11018  20190102  1200  1  1546434000  21    3
 9    11018  20190102  1200  2  1546437600  13    4
 10   11018  20190102  1200  3  1546441200   7    3
 11   11018  20190102  1200  4  1546444800   2    1
 12   11018  20190102  1200  5  1546448400  -3    6
 13   11018  20190102  1200  6  1546452000  -7    2
 14   11035  20190103     0  0  1546473600 -15 -14
 15   11035  20190103     0  1  1546477200 -17 -11
 16   11035  20190103     0  2  1546480800 -20 -12
 17   11035  20190103     0  3  1546484400 -23 -16
 18   11035  20190103     0  4  1546488000 -26 -11
 19   11035  20190103     0  5  1546491600 -28 -11
 20   11035  20190103     0  6  1546495200 -27 -12
 21   11031  20190103  1100  0  1546516800   0   1
 22   11031  20190103  1100  1  1546520400   4  -7
 23   11031  20190103  1100  2  1546524000   5  -6
 24   11031  20190103  1100  3  1546527600   2 -16
 25   11031  20190103  1100  4  1546531200  -3 -14
 26   11031  20190103  1100  5  1546534800  -8 -12
 27   11031  20190103  1100  6  1546538400 -12 -14
 .
 .
 .
 .

And the new dataframe should be:

          A         B     C  D           E   F    G    H
 0    11018  20190102     0  0  1546387200  37   34   34
 1    11018  20190102     0  1  1546390800  33   36   34
 2    11018  20190102     0  2  1546394400  19   19   34
 3    11018  20190102     0  3  1546398000  17   26   34
 4    11018  20190102     0  4  1546401600  16   26   34
 5    11018  20190102     0  5  1546405200  13   23   34
 6    11018  20190102     0  6  1546408800  11   15   34
 7    11018  20190102  1200  0  1546430400  25   24   24
 8    11018  20190102  1200  1  1546434000  21    3   24
 9    11018  20190102  1200  2  1546437600  13    4   24
 10   11018  20190102  1200  3  1546441200   7    3   24
 11   11018  20190102  1200  4  1546444800   2    1   24
 12   11018  20190102  1200  5  1546448400  -3    6   24
 13   11018  20190102  1200  6  1546452000  -7    2   24
 14   11035  20190103     0  0  1546473600 -15 -14   -14
 15   11035  20190103     0  1  1546477200 -17 -11   -14
 16   11035  20190103     0  2  1546480800 -20 -12   -14
 17   11035  20190103     0  3  1546484400 -23 -16   -14
 18   11035  20190103     0  4  1546488000 -26 -11   -14
 19   11035  20190103     0  5  1546491600 -28 -11   -14
 20   11035  20190103     0  6  1546495200 -27 -12   -14
 21   11031  20190103  1100  0  1546516800   0   1     1
 22   11031  20190103  1100  1  1546520400   4  -7     1
 23   11031  20190103  1100  2  1546524000   5  -6     1
 24   11031  20190103  1100  3  1546527600   2 -16     1
 25   11031  20190103  1100  4  1546531200  -3 -14     1
 26   11031  20190103  1100  5  1546534800  -8 -12     1
 27   11031  20190103  1100  6  1546538400 -12 -14     1
 .
 .
 .
 .

Here I already got the solution in the form:

def col_6(df):
     df['H'] = df[df['D'] == 0]['G'].values[0]
     return df
df.groupby(['A','B','C']).apply(col_6)

BUT: In some cases the row where value of the 4th column = 0 is missing. In such cases, the other rows of the groups (with 4th column = 1, 2,..) should be set to NaN.

So, e.g., original frame:

          A         B     C  D           E   F    G
 0    11018  20190102     0  0  1546387200  37   34
 1    11018  20190102     0  1  1546390800  33   36
 2    11018  20190102     0  2  1546394400  19   19
 3    11018  20190102     0  3  1546398000  17   26
 4    11018  20190102     0  4  1546401600  16   26
 5    11018  20190102     0  5  1546405200  13   23
 6    11018  20190102     0  6  1546408800  11   15
 7    11018  20190102  1200  1  1546434000  21    3
 8    11018  20190102  1200  2  1546437600  13    4
 9    11018  20190102  1200  3  1546441200   7    3
 10   11018  20190102  1200  4  1546444800   2    1
 11   11018  20190102  1200  5  1546448400  -3    6
 12   11018  20190102  1200  6  1546452000  -7    2

The final frame should then look:

          A         B     C  D           E   F    G    H
 0    11018  20190102     0  0  1546387200  37   34   34
 1    11018  20190102     0  1  1546390800  33   36   34
 2    11018  20190102     0  2  1546394400  19   19   34
 3    11018  20190102     0  3  1546398000  17   26   34
 4    11018  20190102     0  4  1546401600  16   26   34
 5    11018  20190102     0  5  1546405200  13   23   34
 6    11018  20190102     0  6  1546408800  11   15   34
 7    11018  20190102  1200  1  1546434000  21    3   nan
 8    11018  20190102  1200  2  1546437600  13    4   nan
 9    11018  20190102  1200  3  1546441200   7    3   nan
 10   11018  20190102  1200  4  1546444800   2    1   nan
 11   11018  20190102  1200  5  1546448400  -3    6   nan
 12   11018  20190102  1200  6  1546452000  -7    2   nan

Is there an effective solution on how to solve this problem with the missing rows (based on the general solution above)?

Thanks a lot for help!

Is it always sequences of 0 to 6, where sometimes the 0 is missing? Are there other NaN values in column D? — Josh Friedlander
– Josh Friedlander, Commented Jan 25, 2019 at 13:52
Yes, it is possible that also other values in column D are missing. — akann
– akann, Commented Jan 25, 2019 at 14:01
so this is getting quite confusing. are the rows always ordered like in your example? where you basically want the first row in the group each time (as jezrael suggested)? maybe you first want to fill nans based on some rule? it might be helpful to know some of the thinking behind why you want this configuration — Josh Friedlander
– Josh Friedlander, Commented Jan 27, 2019 at 14:04

jezrael · Accepted Answer · 2019-01-25 09:53:59Z

1

First filter only 0 rows and aggregate first per groups, then add new column by DataFrame.join:

s = (df[df['D'] == 0].groupby(['A','B','C'])['G'].first()).rename('H')
df = df.join(s, on=['A','B','C'])
print (df)
        A         B     C  D           E   F   G     H
0   11018  20190102     0  0  1546387200  37  34  34.0
1   11018  20190102     0  1  1546390800  33  36  34.0
2   11018  20190102     0  2  1546394400  19  19  34.0
3   11018  20190102     0  3  1546398000  17  26  34.0
4   11018  20190102     0  4  1546401600  16  26  34.0
5   11018  20190102     0  5  1546405200  13  23  34.0
6   11018  20190102     0  6  1546408800  11  15  34.0
7   11018  20190102  1200  1  1546434000  21   3   NaN
8   11018  20190102  1200  2  1546437600  13   4   NaN
9   11018  20190102  1200  3  1546441200   7   3   NaN
10  11018  20190102  1200  4  1546444800   2   1   NaN
11  11018  20190102  1200  5  1546448400  -3   6   NaN
12  11018  20190102  1200  6  1546452000  -7   2   NaN

answered Jan 25, 2019 at 9:53

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

akann Over a year ago

Yes, this works. But what if rows other than with D=0 are missing?

jezrael Over a year ago

@akann - Then get NaNs. What need if happens this situtation?

akann Over a year ago

Yes, also NaNs are needed in such a case.

Collectives™ on Stack Overflow

Using pandas dataframe, how to group by multiple columns and adding new column with missing data

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related