2

I have a large csv file. But for simplicity i have removed many rows and columns. It looks like below:

col1 col2 col3
? 27 13000
? 27 13000
validvalue 30
# 26 14000
validvalue 25

I want to detect missing values in this csv file. For e.g: missing values indicated in col1 is by ? and #. In col3 by empty cells. Things would have been easier if the data set has empty cells for all missing values. In that case i could have gone for isnull function of pandas dataframe. But the question is how to identify if the columns has other than empty space as missing value.

Approach if the csv has low number of records

df = pd.read_csv('test.csv')
for e in df.columns:
    print(a[e].unique()]

This will give us all unique value in that particular columns. But i dont find it efficient.

Is their any other way to detect missing values which are denoted by special characters such as (?,#,* etc.) in the csv file?

9
  • Use the replace function on the DataFrame with a lambda x: x in ['?', '#', '*'] and set the value some null. Now all your empty values are null Commented Jan 8, 2022 at 0:18
  • Thats correct, but in that case i must know that how missing values are represented in the csv file. It could be any other character also ($,@ etc.). i want to detect how missing values are represented in the csv file first. Then later i can replace it. Commented Jan 8, 2022 at 0:21
  • You either need to know what's garbage or what's not garbage. Commented Jan 8, 2022 at 0:22
  • Does valid data follow some pattern? Does invalid data follow some pattern? Commented Jan 8, 2022 at 0:23
  • Ok, got it. I think, there is no way to find the garbage value other than using "unique" function. Commented Jan 8, 2022 at 0:25

1 Answer 1

3

As you already stated

there is no way to find the garbage value other than using "unique" function.

But if the number of possible values is big you might help yourself, using .isalnum() to limit the values only to non-alfanumerical strings. For example:

df = pd.DataFrame({"col1": ['?', '?', 'validvalue', '$', 'validvalue'],
                   "col2": [27, 27, 30, 26, 25],
                   "col3": [13000, 13000, None, 14000, None]})

df[~df['col1'].str.isalnum()]['col1'].value_counts()

#Output:
#?    2
#$    1

When you will find all possible NA values, you might use mask on each column (if missings differ column to column) or on whole dataset, for example:

na_values = ('?', '#')
df.mask(df.isin(na_values))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.