I have a dataframe consisting of two columns, Age and Salary
Age Salary
21 25000
22 30000
22 Fresher
23 2,50,000
24 25 LPA
35 400000
45 10,00,000
How to handle outliers in Salary column and replace them with an integer?
If need replace non numeric values use to_numeric with parameter errors='coerce':
df['new'] = pd.to_numeric(df.Salary.astype(str).str.replace(',',''), errors='coerce')
.fillna(0)
.astype(int)
print (df)
Age Salary new
0 21 25000 25000
1 22 30000 30000
2 22 Fresher 0
3 23 2,50,000 250000
4 24 25 LPA 0
5 35 400000 400000
6 45 10,00,000 1000000
errors: raise is the default and throws an error when it encounters nonnumeric characters. coerce returns NaN when it encounters nonnumeric characters. ignore returns the original value when it can't convert to numeric.str, because mixed content - int with str valuesUse numpy where to find non digit value, replace with '0'.
df['New']=df.Salary.apply(lambda x: np.where(x.isdigit(),x,'0'))
If you use Python 3 use the following. I am not sure how other Python versions return type(x). However I would not replace missing or inconsistent values with 0, it is better to replace them with None. But let's say you want to replace string values (outliers or inconsistent values) with 0 :
df['Salary']=df['Salary'].apply(lambda x: 0 if str(type(x))=="<class 'str'>" else x)
df['col']=df['col'].str.replace("[^0-9]",'')