0

I have a CSV-file looking like this:

"row ID","label","val"
"Row0","5",6
"Row1","",6
"Row2","",6
"Row3","5",7
"Row4","5",8
"Row5",,9
"Row6","nan",
"Row7","nan",
"Row8","nan",0
"Row9","nan",3
"Row10","nan",

All quoted entries are strings. Non-quoted entries are numerical. Empty fields are missing values (NaN), Quoted empty fields still should be considered as empty strings. I tried to read it in with pandas read_csv but I cannot get it working the way I would like to have it... It still consideres ,"", and ,, as NaN, while it's not true for the first one.

d = pd.read_csv(csv_filename, sep=',', keep_default_na=False, na_values=[''], quoting = csv.QUOTE_NONNUMERIC)

Can anybody help? Is it possible at all?

3 Answers 3

1

You can try with numpy.genfromtxt and specify the missing_values parameter

http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

Sign up to request clarification or add additional context in comments.

1 Comment

Can you help me with that? I tried: d = np.genfromtxt('test.csv', delimiter = ',', missing_values = [], names = True, dtype=[('row_ID', np.dtype(str)), ('label', np.dtype(str)), ('val', np.dtype(float))]) but it returns empty strings for all (!) string column values. I don't know what is wrong...
0

Maybe something like:

import pandas as pd
import csv
import numpy as np
d = pd.read_csv('test.txt', sep=',', keep_default_na=False, na_values=[''], quoting = csv.QUOTE_NONNUMERIC)
mask = d['label'] == 'nan'
d.label[mask] = np.nan

1 Comment

but I want to keep 'nan' and '' as strings and not as missing values
0

I found a way to get it more or less working. I just don't know, why I need to specify dtype=type(None) to get it working... Comments on this piece of code are very welcome!

import re
import pandas as pd
import numpy as np

# clear quoting characters
def filterTheField(s):
    m = re.match(r'^"?(.*)?"$', s.strip())
    if m:
        return m.group(1)
    else:
        return np.nan

file = 'test.csv'

y = np.genfromtxt(file, delimiter = ',', filling_values = np.nan, names = True, dtype = type(None), converters = {'row_ID': filterTheField, 'label': filterTheField,'val': float})

d = pd.DataFrame(y)

print(d)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.