1

I have read a csv file with pandas read_csv having 8 columns. Each column may contain int/string/float values. But I want to remove those rows having string values and return a data frame with only numeric values in it. Attaching the csv sample.
I have tried to run this following code:

import pandas as pd
import numpy as np  
df = pd.read_csv('new200_with_errors.csv',dtype={'Geo_Level_1' : int,'Geo_Level_2' : int,'Geo_Level_3' : int,'Product_Level_1' : int,'Product_Level_2' : int,'Product_Level_3' : int,'Total_Sale' : float})
print(df)

but I get the following error:

TypeError: unorderable types: NoneType() > int()

I am running with python 3.4.1. Here is the sample csv.

Geo_L_1,Geo_L_2,Geo_L_3,Pro_L_1,Pro_L_2,Pro_L_3,Date,Sale
1, 2, 3, 129, 1, 5193316745, 1/1/2012, 9
1 ,2, 3, 129, 1, 5193316745, 1/1/2013,  
1, 2, 3, 129, 1, 5193316745, , 8
1, 2, 3, 129, NA, 5193316745, 1/10/2012, 10
1, 2, 3, 129, 1, 5193316745, 1/10/2013, 4
1, 2, 3, ghj, 1, 5193316745, 1/10/2014, 6
1, 2, 3, 129, 1, 5193316745, 1/11/2012, 4
1, 2, 3, 129, 1, ghgj, 1/11/2013, 2
1, 2, 3, 129, 1, 5193316745, 1/11/2014, 6
1, 2, 3, 129, 1, 5193316745, 1/12/2012, ghgj
1, 2, 3, 129, 1, 5193316745, 1/12/2013, 5
6
  • I count only 5 columns. Where are Geo_Level_1..3 ? Commented Oct 27, 2014 at 8:05
  • You'll have to post raw data of your complete df, you either have to clean up the csv before or after reading it into pandas Commented Oct 27, 2014 at 8:17
  • the sample data has these errors in the aforesaid columns that is why i had given but the sample data consists of all the 8 columns. @fredtantini Commented Oct 27, 2014 at 8:50
  • i have just edited the main sample csv file i.e complete df. @EdChum Commented Oct 27, 2014 at 9:04
  • What do you want to do with the empty date row? Commented Oct 27, 2014 at 9:45

1 Answer 1

1

So the way I would approach this is to try to convert the columns to an int using a user function with a Try/Catch to handle the situation where the value cannot be coerced into an Int, these get set to NaN values. Drop the row where you have an empty value, for some reason it actually has a length of 1 when I tested this with your data, it may work for you using len 0.

In [42]:
# simple function to try to convert the type, returns NaN if the value cannot be coerced
def func(x):
    try:
        return int(x)
    except ValueError:
        return NaN
# assign multiple columns 
df['Pro_L_1'], df['Pro_L_3'], df['Sale'] = df['Pro_L_1'].apply(func), df['Pro_L_3'].apply(func), df['Sale'].apply(func)
# drop the 'empty' date row, take a copy() so we don't get a warning
df = df.loc[df['Date'].str.len() > 1].copy()
# convert the string to a datetime, if we didn't drop the row it would set the empty row to today's date
df['Date']= pd.to_datetime(df['Date'])
# now convert all the dtypes that are numeric to a numeric dtype
df = df.convert_objects(convert_numeric=True)
# check the dtypes
df.dtypes

Out[42]:
Geo_L_1             int64
Geo_L_2             int64
Geo_L_3             int64
Pro_L_1           float64
Pro_L_2           float64
Pro_L_3           float64
Date       datetime64[ns]
Sale              float64
dtype: object
In [43]:
# display the current situation
df
Out[43]:
    Geo_L_1  Geo_L_2  Geo_L_3  Pro_L_1  Pro_L_2     Pro_L_3       Date  Sale
0         1        2        3      129        1  5193316745 2012-01-01     9
1         1        2        3      129        1  5193316745 2013-01-01   NaN
3         1        2        3      129      NaN  5193316745 2012-01-10    10
4         1        2        3      129        1  5193316745 2013-01-10     4
5         1        2        3      NaN        1  5193316745 2014-01-10     6
6         1        2        3      129        1  5193316745 2012-01-11     4
7         1        2        3      129        1         NaN 2013-01-11     2
8         1        2        3      129        1  5193316745 2014-01-11     6
9         1        2        3      129        1  5193316745 2012-01-12   NaN
10        1        2        3      129        1  5193316745 2013-01-12     5
In [44]:
# drop the rows
df.dropna()
Out[44]:
    Geo_L_1  Geo_L_2  Geo_L_3  Pro_L_1  Pro_L_2     Pro_L_3       Date  Sale
0         1        2        3      129        1  5193316745 2012-01-01     9
4         1        2        3      129        1  5193316745 2013-01-10     4
6         1        2        3      129        1  5193316745 2012-01-11     4
8         1        2        3      129        1  5193316745 2014-01-11     6
10        1        2        3      129        1  5193316745 2013-01-12     5

For the last line assign it so df = df.dropna()

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.