read csv-data with missing values into python using pandas

Question

I have a CSV-file looking like this:

"row ID","label","val"
"Row0","5",6
"Row1","",6
"Row2","",6
"Row3","5",7
"Row4","5",8
"Row5",,9
"Row6","nan",
"Row7","nan",
"Row8","nan",0
"Row9","nan",3
"Row10","nan",

All quoted entries are strings. Non-quoted entries are numerical. Empty fields are missing values (NaN), Quoted empty fields still should be considered as empty strings. I tried to read it in with pandas read_csv but I cannot get it working the way I would like to have it... It still consideres ,"", and ,, as NaN, while it's not true for the first one.

d = pd.read_csv(csv_filename, sep=',', keep_default_na=False, na_values=[''], quoting = csv.QUOTE_NONNUMERIC)

Can anybody help? Is it possible at all?

AnandViswanathan89 · Accepted Answer · 2014-12-01 14:57:36Z

1

You can try with numpy.genfromtxt and specify the missing_values parameter

http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

edited Dec 1, 2014 at 14:57

answered Dec 1, 2014 at 14:16

AnandViswanathan89

1191 silver badge10 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Antje Janosch Over a year ago

Can you help me with that? I tried: d = np.genfromtxt('test.csv', delimiter = ',', missing_values = [], names = True, dtype=[('row_ID', np.dtype(str)), ('label', np.dtype(str)), ('val', np.dtype(float))]) but it returns empty strings for all (!) string column values. I don't know what is wrong...

Moritz · Accepted Answer · 2014-12-01 13:59:54Z

0

Maybe something like:

import pandas as pd
import csv
import numpy as np
d = pd.read_csv('test.txt', sep=',', keep_default_na=False, na_values=[''], quoting = csv.QUOTE_NONNUMERIC)
mask = d['label'] == 'nan'
d.label[mask] = np.nan

answered Dec 1, 2014 at 13:59

Moritz

5,44813 gold badges47 silver badges89 bronze badges

1 Comment

Antje Janosch Over a year ago

but I want to keep 'nan' and '' as strings and not as missing values

Antje Janosch · Accepted Answer · 2014-12-03 09:29:17Z

0

I found a way to get it more or less working. I just don't know, why I need to specify dtype=type(None) to get it working... Comments on this piece of code are very welcome!

import re
import pandas as pd
import numpy as np

# clear quoting characters
def filterTheField(s):
    m = re.match(r'^"?(.*)?"$', s.strip())
    if m:
        return m.group(1)
    else:
        return np.nan

file = 'test.csv'

y = np.genfromtxt(file, delimiter = ',', filling_values = np.nan, names = True, dtype = type(None), converters = {'row_ID': filterTheField, 'label': filterTheField,'val': float})

d = pd.DataFrame(y)

print(d)

answered Dec 3, 2014 at 9:29

Antje Janosch

1,1966 gold badges20 silver badges38 bronze badges

Collectives™ on Stack Overflow

read csv-data with missing values into python using pandas

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related