0

Dataframe with 3 columns:

FLAG CLASS   CATEGORY
yes 'Sci'   'Alpha'
yes 'Sci'   'undefined'
yes 'math'  'Beta'
yes 'math'  'undefined'
yes 'eng'   'Gamma'
yes 'math'  'Beta'
yes 'eng'   'Gamma'
yes 'eng'   'Omega'
yes 'eng'   'Omega'
yes 'eng'   'undefined'
yes 'Geog'  'Lambda'
yes 'Art'   'undefined'
yes 'Art'   'undefined'
yes 'Art'   'undefined'

I want to fill up the 'undefined' values in the column CATEGORY with the other category value (if any) that the class has. E.g. The Science class will fill up its empty category with 'Alpha', The 'math' class will fill up its 'undefined' category with 'Beta'.

In the case there are 2 or more categories to consider, leave as is. E.g. The english class 'eng' has two categories 'Gamma' and 'Omega', so the category 'undefined' for the class English will be left as 'undefined'

If all the categories for a class are 'undefined', leave as 'undefined'.

Result

FLAG CLASS   CATEGORY
yes 'Sci'   'Alpha'
yes 'Sci'   'Alpha'
yes 'math'  'Beta'
yes 'math'  'Beta'
yes 'eng'   'Gamma'
yes 'math'  'Beta'
yes 'eng'   'Gamma'
yes 'eng'   'Gamma'
yes 'eng'   'Omega'
yes 'eng'   'Omega'
yes 'eng'   'undefined'
yes 'Geog'  'Lambda'
yes 'Art'   'undefined'
yes 'Art'   'undefined'
yes 'Art'   'undefined'

IT NEEDS TO GENERALIZE. I HAVE MANY CLASSES IN THE CLASS COLUMN and cannot afford to encode 'Sci' or 'eng'.

I have been trying this with multiple np.wheres but had no luck.

4 Answers 4

2

I will using ffill and bffil within groupby

s=df.CATEGORY.mask(df.CATEGORY.eq('undefined'))
s2=s.groupby(df['CLASS']).transform('nunique')
df.loc[s2.eq(1)&s.isnull(),'CATEGORY']=s.groupby(df.CLASS).apply(lambda x : x.ffill().bfill())
df
Out[388]: 
   FLAG CLASS   CATEGORY
0   yes   Sci      Alpha
1   yes   Sci      Alpha
2   yes  math       Beta
3   yes  math       Beta
4   yes   eng      Gamma
5   yes  math       Beta
6   yes   eng      Gamma
7   yes   eng      Omega
8   yes   eng      Omega
9   yes   eng  undefined
10  yes  Geog     Lambda
11  yes   Art  undefined
12  yes   Art  undefined
13  yes   Art  undefined
Sign up to request clarification or add additional context in comments.

3 Comments

Would you change your code to be agnostic to the class values? I have many classes and cannot afford encoding the classes.
@Kaisar WenYoBen has eng got filled, but your output has it as undefined. Which one do you want?
@AndyL. Thank you for noticing. The post answer is to follow.
1

Try below:

df['CATEGORY'] = df.replace('undefined', np.nan, regex=True).groupby('CLASS')['CATEGORY'].apply(lambda x: x.fillna(x.mode()[0]) if not x.isna().all() else x).replace(np.nan, "\'undefined\'")

Comments

1

Edit:
I add another solution using isin to filter out on valid class for updating both not undefined and undefined. Then, updating this exact slice of df.

Steps:
Creating m as the series of CLASS has CATEGORY as undifined and unique not undefined values. Using isin to select qualified rows and where to turn undefined to NaN. Finally, Groupby by CLASS on these row, ffill, bfill per group to fill NaN and assign back to df

m = df.query('CATEGORY!="undefined"').drop_duplicates().CLASS.drop_duplicates(keep=False)
df[df.CLASS.isin(m)] = df[df.CLASS.isin(m)].where(df!='undefined').groupby('CLASS').ffill().bfill()

This solution looks cleaner, but I don't know whether it is slower than original solution since using groupby


Original:
My solution constructs 'not undefined' from 'undefined' mapped by unique 'not undefined' values:

m = df.query('CATEGORY != "undefined"').drop_duplicates().CLASS.drop_duplicates(keep=False)
t = df.query('CATEGORY == "undefined"').CLASS.map(df.loc[m.index].set_index('CLASS').CATEGORY)
df['CATEGORY'].update(t)

Out[553]:
   FLAG CLASS   CATEGORY
0   yes   Sci      Alpha
1   yes   Sci      Alpha
2   yes  math       Beta
3   yes  math       Beta
4   yes   eng      Gamma
5   yes  math       Beta
6   yes   eng      Gamma
7   yes   eng      Omega
8   yes   eng      Omega
9   yes   eng  undefined
10  yes  Geog     Lambda
11  yes   Art  undefined
12  yes   Art  undefined
13  yes   Art  undefined

Comments

0

you can do by using boolian indesing

df[(df['CLASS']=='Sci'& df['CATEGORY']=='undefined','CATEGORY')]='Alpha'
df[(df['CLASS']=='math'& df['CATEGORY']=='undefined','CATEGORY')]='Beta'

2 Comments

This is a vanilla example. I have thousands of classes values in the CLASS column. I doubt I can write a line of code per class.
I'd appreciate it if you can rewrite a solution to be more general.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.