3

I have a Pandas dataframe which contains students and percentages of marks obtained by them. There are some students whose marks are shown as greater than 100%. Obviously these values are incorrect and I would like to replace all percentage values which are greater than 100% by NaN.

I have tried on some code but not quite able to get exactly what I would like to desire.

import numpy as np
import pandas as pd

new_DF = pd.DataFrame({'Student' : ['S1', 'S2', 'S3', 'S4', 'S5'],
                       'Percentages' : [85, 70, 101, 55, 120]})

#  Percentages  Student
#0          85       S1
#1          70       S2
#2         101       S3
#3          55       S4
#4         120       S5

new_DF[(new_DF.iloc[:, 0] > 100)] = np.NaN

#  Percentages  Student
#0        85.0       S1
#1        70.0       S2
#2         NaN      NaN
#3        55.0       S4
#4         NaN      NaN

As you can see the code kind of works but it actually replaces all the values in that particular row where Percentages is greater than 100 by NaN. I would only like to replace the value in Percentages column by NaN where its greater than 100. Is there any way to do that?

4 Answers 4

3

Try and use np.where:

new_DF.Percentages=np.where(new_DF.Percentages.gt(100),np.nan,new_DF.Percentages)

or

new_DF.loc[new_DF.Percentages.gt(100),'Percentages']=np.nan

print(new_DF)

  Student  Percentages
0      S1         85.0
1      S2         70.0
2      S3          NaN
3      S4         55.0
4      S5          NaN
Sign up to request clarification or add additional context in comments.

2 Comments

@JohnE yes , also depends on the size of df i think? for larger dfs shouldnt np.where work faster? BDW uncommented now. :) Thanks
Yeah, I think you are right. Generally np.where is very fast.
2

Also,

df.Percentages = df.Percentages.apply(lambda x: np.nan if x>100 else x)

or,

df.Percentages = df.Percentages.where(df.Percentages<100, np.nan)

2 Comments

This will work too. :) However avoid apply when you can , its slow.
Agree with @anky_91, try to avoid .apply when its not needed.
1

You can use .loc:

new_DF.loc[new_DF['Percentages']>100, 'Percentages'] = np.NaN

Output:

  Student  Percentages
0      S1         85.0
1      S2         70.0
2      S3          NaN
3      S4         55.0
4      S5          NaN

2 Comments

this is already there in my solution(check commented part) not sure how is this any different
Understood now :)
0
import numpy as np
import pandas as pd

new_DF = pd.DataFrame({'Student' : ['S1', 'S2', 'S3', 'S4', 'S5'],
                      'Percentages' : [85, 70, 101, 55, 120]})
#print(new_DF['Student'])
index=-1
for i in new_DF['Percentages']:
    index+=1
    if i > 100:
        new_DF['Percentages'][index] = "nan"




print(new_DF)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.