4

Data is of income of adults from census data, rows look like:

31, Private, 84154, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 38, NaN, >50K
48, Self-emp-not-inc, 265477, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K

I'm trying to remove all rows with NaNs from a DataFrame loaded from a CSV file in pandas.

>>> import pandas as pd
>>> income = pd.read_csv('income.data')
>>> income['type'].unique()
array([ State-gov,  Self-emp-not-inc,  Private,  Federal-gov,  Local-gov,
    NaN,  Self-emp-inc,  Without-pay,  Never-worked], dtype=object)
>>> income.dropna(how='any') # should drop all rows with NaNs
>>> income['type'].unique()
array([ State-gov,  Self-emp-not-inc,  Private,  Federal-gov,  Local-gov,
    NaN,  Self-emp-inc,  Without-pay,  Never-worked], dtype=object)
    Self-emp-inc, nan], dtype=object) # what??
>>> income = income.dropna(how='any') # ok, maybe reassignment will work?
>>> income['type'].unique()
array([ State-gov,  Self-emp-not-inc,  Private,  Federal-gov,  Local-gov,
    NaN,  Self-emp-inc,  Without-pay,  Never-worked], dtype=object) # what??

I tried with a smaller example.csv:

label,age,sex
1,43,M
-1,NaN,F
1,65,NaN

And dropna() worked just fine here for both categorical and numerical NaNs. What is going on? I'm new to Pandas, just learning the ropes.

6
  • Try assigning the line income.dropna(how='any') to a variable and check the values on that. dropna() is not inplace by default (I think the inplace option may have been added after .12). Commented Nov 18, 2013 at 17:13
  • @TomAugspurger: No also doesn't work. Commented Nov 18, 2013 at 17:18
  • Tried df.dropna(thresh = 1) ? More info about your data would be good.. Commented Nov 18, 2013 at 17:22
  • didn't work either. both with and without reassignment Commented Nov 18, 2013 at 17:24
  • 1
    I just copy-pasted your data from above into a blank csv, imported it to pandas. It looks like the "NaN" is recognized as a string with a leading whitespace " NaN". Use na_values=" NaN" int hthe csv-import, then the dropna works fine. Commented Nov 18, 2013 at 17:40

2 Answers 2

8

As I wrote in the comment: The "NaN" has a leading whitespace (at least in the data you provided). Therefore, you need to specifiy the na_values paramter in the read_csv function.

Try this one:

df = pd.read_csv("income.csv",header=None,na_values=" NaN")

This is why your second example works, because there is no leading whitespace here.

Sign up to request clarification or add additional context in comments.

2 Comments

ah yep...that does it. is there a way to make pandas strip elements in CSVs? that would seem like a fairly common task (one I just expected to be built in).
No, not by default i guess (in some cases, whitespace may be usefull). But you could use pd.read_csv(StringIO(data), skipinitialspace=True) (i.e. the skipinitalspace-option, see here, or you could try using " ," or a regular expression as a custom seperator.
2

Drop all rows with NaN values

df2=df.dropna()
df2=df.dropna(axis=0)

Reset index after drop

df2=df.dropna().reset_index(drop=True)

Drop row that has all NaN values

df2=df.dropna(how='all')

Drop rows that has NaN values on selected columns

df2=df.dropna(subset=['length','Height'])

1 Comment

Try your code before posting your answer. Your code will not remove NaN values.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.