Updating column in a dataframe based on multiple columns

Question

I have a column named "age" with a few NaN; crude logic of deriving the value of the age is finding the mean of age using 2 key categorical variables - job, gender

df = pd.DataFrame([[1,2,1,2,3,4,11,12,13,12,11,1,10], [19,23,np.nan,29,np.nan,32,27,48,39,70,29,51,np.nan], 
            ['a','b','c','d','e','a','b','c','d','e','a','b','c'],['M','F','M','F','M','F','M','F','M','M','F','F','F']]).T
df.columns = ['col1','age','job','gender']

df = df.astype({"col1": int, "age": float})
df['job'] = df.job.astype('category')
df['gender'] = df.gender.astype('category')
df

col1    age job gender
0   1   19.0    a   M
1   2   23.0    b   F
2   1   NaN     c   M
3   2   29.0    d   F
4   3   NaN     e   M
5   4   32.0    a   F
6   11  27.0    b   M
7   12  48.0    c   F
8   13  39.0    d   M
9   12  70.0    e   M
10  11  29.0    a   F
11  1   51.0    b   F
12  10  NaN     c   M

df.groupby(['job','gender']).mean().reset_index()

   job  gender  col1    age
0   a   F   7.500000    30.5
1   a   M   1.000000    19.0
2   b   F   1.500000    37.0
3   b   M   11.000000   27.0
4   c   F   NaN NaN
5   c   M   7.666667    48.0
6   d   F   7.500000    34.0
7   d   M   NaN NaN
8   e   F   NaN NaN
9   e   M   7.500000    70.0

I want to update the age to the derived value from above. What is the optimal way of doing it? Should I store it in another dataframe and loop it through for updation?

Resultant output should look like this:

col1    age job gender
0   1   19.0    a   M
1   2   23.0    b   F
2   1   48.0    c   M
3   2   29.0    d   F
4   3   70.0    e   M
5   4   32.0    a   F
6   11  27.0    b   M
7   12  48.0    c   F
8   13  39.0    d   M
9   12  70.0    e   M
10  11  29.0    a   F
11  1   51.0    b   F
12  10  70.0    c   M

Thanks.

jezrael · Accepted Answer · 2021-06-08 08:16:35Z

2

Use Series.fillna with GroupBy.transform, but because in sample data are not data for combination c, M there is NaN:

df['age'] = df['age'].fillna(df.groupby(['job','gender'])['age'].transform('mean'))
print (df)
    col1   age job gender
0      1  19.0   a      M
1      2  23.0   b      F
2      1   NaN   c      M
3      2  29.0   d      F
4      3  70.0   e      M
5      4  32.0   a      F
6     11  27.0   b      M
7     12  48.0   c      F
8     13  39.0   d      M
9     12  70.0   e      M
10    11  29.0   a      F
11     1  51.0   b      F
12    10  48.0   c      F

If need also replace NaN by groiping only by id add another fillna:

avg1 = df.groupby(['job','gender'])['age'].transform('mean')
avg2 = df.groupby('job')['age'].transform('mean')

df['age'] = df['age'].fillna(avg1).fillna(avg2)
print (df)
    col1   age job gender
0      1  19.0   a      M
1      2  23.0   b      F
2      1  48.0   c      M
3      2  29.0   d      F
4      3  70.0   e      M
5      4  32.0   a      F
6     11  27.0   b      M
7     12  48.0   c      F
8     13  39.0   d      M
9     12  70.0   e      M
10    11  29.0   a      F
11     1  51.0   b      F
12    10  48.0   c      F

answered Jun 8, 2021 at 8:16

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

kkulkarn Over a year ago

The dataset actually has 1000's of records and hence there is a good possibility of finding the age. In case age is not available then we will have to work on a different logic but for now, this works - thanks!

Collectives™ on Stack Overflow

Updating column in a dataframe based on multiple columns

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related