I am dealing with a huge dataframe with hundreds of columns with possibility of missing values in each of the columns. Here is sample:
import pandas as pd
import numpy as np
data = {'a': [1,1,0,1,1],
'b': ["a", "b", np.nan, 'c', np.nan],
'c': ['b1','b2',np.nan, 'c1', np.nan],
'd': [1,1,1, 2, np.nan],
'e': [4,4,4, 3, np.nan]
}
df = pd.DataFrame(data)
print(df)
a b c d e
0 1 a b1 1.0 4.0
1 1 b b2 1.0 4.0
2 0 NaN NaN 1.0 4.0
3 1 c c1 2.0 3.0
4 1 NaN NaN NaN NaN
In order to deal with missing values at once, I am doing something like this. Which basically if the missing values are in one the a,b, or c columns, then I replace them with a specific value.
df=df.fillna({'a':0, 'b':'other', 'c':-1})
print (df)
a b c d e
0 1 a b1 1.0 4.0
1 1 b b2 1.0 4.0
2 0 other -1 1.0 4.0
3 1 c c1 2.0 3.0
4 1 other -1 NaN NaN
What I would like to do is if the missing values in any other columns than those three columns, then simply replace the missing values with a value that appears the most often in that column. For example, in column d, 1 is repeated the most so I simply replace missing value in with 1.0.
[1, 2, 1, 2, NaN]?