2

I have a column named "age" with a few NaN; crude logic of deriving the value of the age is finding the mean of age using 2 key categorical variables - job, gender

df = pd.DataFrame([[1,2,1,2,3,4,11,12,13,12,11,1,10], [19,23,np.nan,29,np.nan,32,27,48,39,70,29,51,np.nan], 
            ['a','b','c','d','e','a','b','c','d','e','a','b','c'],['M','F','M','F','M','F','M','F','M','M','F','F','F']]).T
df.columns = ['col1','age','job','gender']

df = df.astype({"col1": int, "age": float})
df['job'] = df.job.astype('category')
df['gender'] = df.gender.astype('category')
df

col1    age job gender
0   1   19.0    a   M
1   2   23.0    b   F
2   1   NaN     c   M
3   2   29.0    d   F
4   3   NaN     e   M
5   4   32.0    a   F
6   11  27.0    b   M
7   12  48.0    c   F
8   13  39.0    d   M
9   12  70.0    e   M
10  11  29.0    a   F
11  1   51.0    b   F
12  10  NaN     c   M

df.groupby(['job','gender']).mean().reset_index()

   job  gender  col1    age
0   a   F   7.500000    30.5
1   a   M   1.000000    19.0
2   b   F   1.500000    37.0
3   b   M   11.000000   27.0
4   c   F   NaN NaN
5   c   M   7.666667    48.0
6   d   F   7.500000    34.0
7   d   M   NaN NaN
8   e   F   NaN NaN
9   e   M   7.500000    70.0

I want to update the age to the derived value from above. What is the optimal way of doing it? Should I store it in another dataframe and loop it through for updation?

Resultant output should look like this:

col1    age job gender
0   1   19.0    a   M
1   2   23.0    b   F
2   1   48.0    c   M
3   2   29.0    d   F
4   3   70.0    e   M
5   4   32.0    a   F
6   11  27.0    b   M
7   12  48.0    c   F
8   13  39.0    d   M
9   12  70.0    e   M
10  11  29.0    a   F
11  1   51.0    b   F
12  10  70.0    c   M

Thanks.

1 Answer 1

2

Use Series.fillna with GroupBy.transform, but because in sample data are not data for combination c, M there is NaN:

df['age'] = df['age'].fillna(df.groupby(['job','gender'])['age'].transform('mean'))
print (df)
    col1   age job gender
0      1  19.0   a      M
1      2  23.0   b      F
2      1   NaN   c      M
3      2  29.0   d      F
4      3  70.0   e      M
5      4  32.0   a      F
6     11  27.0   b      M
7     12  48.0   c      F
8     13  39.0   d      M
9     12  70.0   e      M
10    11  29.0   a      F
11     1  51.0   b      F
12    10  48.0   c      F

If need also replace NaN by groiping only by id add another fillna:

avg1 = df.groupby(['job','gender'])['age'].transform('mean')
avg2 = df.groupby('job')['age'].transform('mean')

df['age'] = df['age'].fillna(avg1).fillna(avg2)
print (df)
    col1   age job gender
0      1  19.0   a      M
1      2  23.0   b      F
2      1  48.0   c      M
3      2  29.0   d      F
4      3  70.0   e      M
5      4  32.0   a      F
6     11  27.0   b      M
7     12  48.0   c      F
8     13  39.0   d      M
9     12  70.0   e      M
10    11  29.0   a      F
11     1  51.0   b      F
12    10  48.0   c      F
Sign up to request clarification or add additional context in comments.

1 Comment

The dataset actually has 1000's of records and hence there is a good possibility of finding the age. In case age is not available then we will have to work on a different logic but for now, this works - thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.