To check missing values in csv file using Pandas

Question

I have a large csv file. But for simplicity i have removed many rows and columns. It looks like below:

col1	col2	col3
?	27	13000
?	27	13000
validvalue	30
#	26	14000
validvalue	25

I want to detect missing values in this csv file. For e.g: missing values indicated in col1 is by ? and #. In col3 by empty cells. Things would have been easier if the data set has empty cells for all missing values. In that case i could have gone for isnull function of pandas dataframe. But the question is how to identify if the columns has other than empty space as missing value.

Approach if the csv has low number of records

df = pd.read_csv('test.csv')
for e in df.columns:
    print(a[e].unique()]

This will give us all unique value in that particular columns. But i dont find it efficient.

Is their any other way to detect missing values which are denoted by special characters such as (?,#,* etc.) in the csv file?

Use the replace function on the DataFrame with a lambda x: x in ['?', '#', '*'] and set the value some null. Now all your empty values are null — MYousefi
– MYousefi, Commented Jan 8, 2022 at 0:18
Thats correct, but in that case i must know that how missing values are represented in the csv file. It could be any other character also ($,@ etc.). i want to detect how missing values are represented in the csv file first. Then later i can replace it. — sats
– sats, Commented Jan 8, 2022 at 0:21
You either need to know what's garbage or what's not garbage. — MYousefi
– MYousefi, Commented Jan 8, 2022 at 0:22
Does valid data follow some pattern? Does invalid data follow some pattern? — MYousefi
– MYousefi, Commented Jan 8, 2022 at 0:23
Ok, got it. I think, there is no way to find the garbage value other than using "unique" function. — sats
– sats, Commented Jan 8, 2022 at 0:25

Daniel Wlazło · Accepted Answer · 2022-01-08 00:48:33Z

3

As you already stated

there is no way to find the garbage value other than using "unique" function.

But if the number of possible values is big you might help yourself, using .isalnum() to limit the values only to non-alfanumerical strings. For example:

df = pd.DataFrame({"col1": ['?', '?', 'validvalue', '$', 'validvalue'],
                   "col2": [27, 27, 30, 26, 25],
                   "col3": [13000, 13000, None, 14000, None]})

df[~df['col1'].str.isalnum()]['col1'].value_counts()

#Output:
#?    2
#$    1

When you will find all possible NA values, you might use mask on each column (if missings differ column to column) or on whole dataset, for example:

na_values = ('?', '#')
df.mask(df.isin(na_values))

edited Jan 8, 2022 at 0:48

answered Jan 8, 2022 at 0:39

Daniel Wlazło

1,1551 gold badge8 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

To check missing values in csv file using Pandas

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related