0

I want to group a 6-column dataframe for all rows with the same values in the first 3 columns, and then i want to add a new column with the value of the last column where the value of the 4th column = 0.

The original dataframe looks like this:

          A         B     C  D           E   F    G
 0    11018  20190102     0  0  1546387200  37   34
 1    11018  20190102     0  1  1546390800  33   36
 2    11018  20190102     0  2  1546394400  19   19
 3    11018  20190102     0  3  1546398000  17   26
 4    11018  20190102     0  4  1546401600  16   26
 5    11018  20190102     0  5  1546405200  13   23
 6    11018  20190102     0  6  1546408800  11   15
 7    11018  20190102  1200  0  1546430400  25   24
 8    11018  20190102  1200  1  1546434000  21    3
 9    11018  20190102  1200  2  1546437600  13    4
 10   11018  20190102  1200  3  1546441200   7    3
 11   11018  20190102  1200  4  1546444800   2    1
 12   11018  20190102  1200  5  1546448400  -3    6
 13   11018  20190102  1200  6  1546452000  -7    2
 14   11035  20190103     0  0  1546473600 -15 -14
 15   11035  20190103     0  1  1546477200 -17 -11
 16   11035  20190103     0  2  1546480800 -20 -12
 17   11035  20190103     0  3  1546484400 -23 -16
 18   11035  20190103     0  4  1546488000 -26 -11
 19   11035  20190103     0  5  1546491600 -28 -11
 20   11035  20190103     0  6  1546495200 -27 -12
 21   11031  20190103  1100  0  1546516800   0   1
 22   11031  20190103  1100  1  1546520400   4  -7
 23   11031  20190103  1100  2  1546524000   5  -6
 24   11031  20190103  1100  3  1546527600   2 -16
 25   11031  20190103  1100  4  1546531200  -3 -14
 26   11031  20190103  1100  5  1546534800  -8 -12
 27   11031  20190103  1100  6  1546538400 -12 -14
 .
 .
 .
 .

And the new dataframe should be:

          A         B     C  D           E   F    G    H
 0    11018  20190102     0  0  1546387200  37   34   34
 1    11018  20190102     0  1  1546390800  33   36   34
 2    11018  20190102     0  2  1546394400  19   19   34
 3    11018  20190102     0  3  1546398000  17   26   34
 4    11018  20190102     0  4  1546401600  16   26   34
 5    11018  20190102     0  5  1546405200  13   23   34
 6    11018  20190102     0  6  1546408800  11   15   34
 7    11018  20190102  1200  0  1546430400  25   24   24
 8    11018  20190102  1200  1  1546434000  21    3   24
 9    11018  20190102  1200  2  1546437600  13    4   24
 10   11018  20190102  1200  3  1546441200   7    3   24
 11   11018  20190102  1200  4  1546444800   2    1   24
 12   11018  20190102  1200  5  1546448400  -3    6   24
 13   11018  20190102  1200  6  1546452000  -7    2   24
 14   11035  20190103     0  0  1546473600 -15 -14   -14
 15   11035  20190103     0  1  1546477200 -17 -11   -14
 16   11035  20190103     0  2  1546480800 -20 -12   -14
 17   11035  20190103     0  3  1546484400 -23 -16   -14
 18   11035  20190103     0  4  1546488000 -26 -11   -14
 19   11035  20190103     0  5  1546491600 -28 -11   -14
 20   11035  20190103     0  6  1546495200 -27 -12   -14
 21   11031  20190103  1100  0  1546516800   0   1     1
 22   11031  20190103  1100  1  1546520400   4  -7     1
 23   11031  20190103  1100  2  1546524000   5  -6     1
 24   11031  20190103  1100  3  1546527600   2 -16     1
 25   11031  20190103  1100  4  1546531200  -3 -14     1
 26   11031  20190103  1100  5  1546534800  -8 -12     1
 27   11031  20190103  1100  6  1546538400 -12 -14     1
 .
 .
 .
 .

Here I already got the solution in the form:

def col_6(df):
     df['H'] = df[df['D'] == 0]['G'].values[0]
     return df
df.groupby(['A','B','C']).apply(col_6)

BUT: In some cases the row where value of the 4th column = 0 is missing. In such cases, the other rows of the groups (with 4th column = 1, 2,..) should be set to NaN.

So, e.g., original frame:

          A         B     C  D           E   F    G
 0    11018  20190102     0  0  1546387200  37   34
 1    11018  20190102     0  1  1546390800  33   36
 2    11018  20190102     0  2  1546394400  19   19
 3    11018  20190102     0  3  1546398000  17   26
 4    11018  20190102     0  4  1546401600  16   26
 5    11018  20190102     0  5  1546405200  13   23
 6    11018  20190102     0  6  1546408800  11   15
 7    11018  20190102  1200  1  1546434000  21    3
 8    11018  20190102  1200  2  1546437600  13    4
 9    11018  20190102  1200  3  1546441200   7    3
 10   11018  20190102  1200  4  1546444800   2    1
 11   11018  20190102  1200  5  1546448400  -3    6
 12   11018  20190102  1200  6  1546452000  -7    2

The final frame should then look:

          A         B     C  D           E   F    G    H
 0    11018  20190102     0  0  1546387200  37   34   34
 1    11018  20190102     0  1  1546390800  33   36   34
 2    11018  20190102     0  2  1546394400  19   19   34
 3    11018  20190102     0  3  1546398000  17   26   34
 4    11018  20190102     0  4  1546401600  16   26   34
 5    11018  20190102     0  5  1546405200  13   23   34
 6    11018  20190102     0  6  1546408800  11   15   34
 7    11018  20190102  1200  1  1546434000  21    3   nan
 8    11018  20190102  1200  2  1546437600  13    4   nan
 9    11018  20190102  1200  3  1546441200   7    3   nan
 10   11018  20190102  1200  4  1546444800   2    1   nan
 11   11018  20190102  1200  5  1546448400  -3    6   nan
 12   11018  20190102  1200  6  1546452000  -7    2   nan

Is there an effective solution on how to solve this problem with the missing rows (based on the general solution above)?

Thanks a lot for help!

3
  • Is it always sequences of 0 to 6, where sometimes the 0 is missing? Are there other NaN values in column D? Commented Jan 25, 2019 at 13:52
  • Yes, it is possible that also other values in column D are missing. Commented Jan 25, 2019 at 14:01
  • so this is getting quite confusing. are the rows always ordered like in your example? where you basically want the first row in the group each time (as jezrael suggested)? maybe you first want to fill nans based on some rule? it might be helpful to know some of the thinking behind why you want this configuration Commented Jan 27, 2019 at 14:04

1 Answer 1

1

First filter only 0 rows and aggregate first per groups, then add new column by DataFrame.join:

s = (df[df['D'] == 0].groupby(['A','B','C'])['G'].first()).rename('H')
df = df.join(s, on=['A','B','C'])
print (df)
        A         B     C  D           E   F   G     H
0   11018  20190102     0  0  1546387200  37  34  34.0
1   11018  20190102     0  1  1546390800  33  36  34.0
2   11018  20190102     0  2  1546394400  19  19  34.0
3   11018  20190102     0  3  1546398000  17  26  34.0
4   11018  20190102     0  4  1546401600  16  26  34.0
5   11018  20190102     0  5  1546405200  13  23  34.0
6   11018  20190102     0  6  1546408800  11  15  34.0
7   11018  20190102  1200  1  1546434000  21   3   NaN
8   11018  20190102  1200  2  1546437600  13   4   NaN
9   11018  20190102  1200  3  1546441200   7   3   NaN
10  11018  20190102  1200  4  1546444800   2   1   NaN
11  11018  20190102  1200  5  1546448400  -3   6   NaN
12  11018  20190102  1200  6  1546452000  -7   2   NaN
Sign up to request clarification or add additional context in comments.

3 Comments

Yes, this works. But what if rows other than with D=0 are missing?
@akann - Then get NaNs. What need if happens this situtation?
Yes, also NaNs are needed in such a case.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.