using pandas read_csv with missing data

Question

I am attempting to read a csv file where some rows may be missing chunks of data.

This seems to be causing a problem with the pandas read_csv function when you specify the dtype. The problem appears that in order to convert from the str to whatever the dtype specifies pandas just tries to cast it directly. Therefore, if something is missing things break down.

A MWE follows (this MWE uses StringIO in place of a true file; however, the issue also happens with a real file being used)

import pandas as pd
import numpy as np
import io

datfile = io.StringIO("12 23 43| | 37| 12.23| 71.3\n12 23 55|X|   |      | 72.3")

names = ['id', 'flag', 'number', 'data', 'data2']
dtypes = [np.str, np.str, np.int, np.float, np.float]

dform = {name: dtypes[ind] for ind, name in enumerate(names)}

colconverters = {0: lambda s: s.strip(), 1: lambda s: s.strip()}

df = pd.read_table(datfile, sep='|', dtype=dform, converters=colconverters, header=None,
                   index_col=0, names=names, na_values=' ')

The error I get when I run this is

Traceback (most recent call last):
  File "pandas/parser.pyx", line 1084, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:12580)
TypeError: Cannot cast array from dtype('O') to dtype('int64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/aliounis/Repos/stellarpy/source/mwe.py", line 15, in <module>
    index_col=0, names=names, na_values=' ')
  File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 325, in _read
    return parser.read()
  File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 815, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 1314, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
  File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
  File "pandas/parser.pyx", line 904, in pandas.parser.TextReader._read_rows (pandas/parser.c:10022)
  File "pandas/parser.pyx", line 1011, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:11397)
  File "pandas/parser.pyx", line 1090, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:12656)
ValueError: invalid literal for int() with base 10: '   '

Is there someway I can fix this. I looked through the documentation but didn't see anything that looked like it would directly address this solution. Is this just a bug that needs to be reported to panda?

Merlin · Accepted Answer · 2016-06-06 14:08:35Z

2

Try this:

import pandas as pd
import numpy as np
import io

datfile = io.StringIO(u"12 23 43| | 37| 12.23| 71.3\n12 23 55|X|   |      | 72.3")

names  = ['id', 'flag', 'number', 'data', 'data2']
dtypes = [np.str, np.str, np.str, np.float, np.float] 
dform  = {name: dtypes[ind] for ind, name in enumerate(names)}

colconverters = {0: lambda s: s.strip(), 1: lambda s: s.strip()}

df     = pd.read_table(datfile, sep='|', dtype=dform, converters=colconverters, header=None, na_values=' ')
df.columns = names

Edit: To converter dtypes post imports.

df["number"] = df["data"].astype('int')
df["data"]   = df["data"].astype('float')

Your data has mixed of blanks as str and numbers.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
id        2 non-null object
flag      2 non-null object
number    2 non-null object
data      2 non-null object
data2     2 non-null float64
dtypes: float64(1), object(4)
memory usage: 152.0+ bytes

If you look at data it is np.float but converted to object and data2 is np.float until a blank then it will turn into object also.

edited Jun 6, 2016 at 14:08

answered Jun 3, 2016 at 21:53

Merlin

25.9k44 gold badges141 silver badges213 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Andrew Over a year ago

While this runs without error it essentially just reads in everything as a string because the names field is used to query the dtypes dictionary to see what to convert the values to.

Andrew · Accepted Answer · 2016-06-03 23:42:00Z

0

So, as Merlin pointed out, the main problem is that nan's can't be ints, which is probably why pandas acts this way to begin with. I unfortunately didn't have a choice so I had to make some changes to the pandas source code myself. I ended up having to change lines 1087-1096 of the file parser.pyx to

        na_count_old = na_count
        print(col_res)
        for ind, row in enumerate(col_res):
            k = kh_get_str(na_hashset, row.strip().encode())
            if k != na_hashset.n_buckets:

                col_res[ind] = np.nan

                na_count += 1

            else:

                col_res[ind] = np.array(col_res[ind]).astype(col_dtype).item(0)

        if na_count_old==na_count:

            # float -> int conversions can fail the above
            # even with no nans
            col_res_orig = col_res
            col_res = col_res.astype(col_dtype)
            if (col_res != col_res_orig).any():
                raise ValueError("cannot safely convert passed user dtype of "
                                 "{col_dtype} for {col_res} dtyped data in "
                                 "column {column}".format(col_dtype=col_dtype,
                                                          col_res=col_res_orig.dtype.name,
                                                          column=i))

which essentially goes through each element of a column, checks to see if each element is contained in the na list (note that we have to strip the stuff so that multi-spaces show up as being in the na list). If it is then that element is set as a double np.nan. If it is not in the na list then it is cast to the original dtype specified for that column (that means the column will have multiple dtypes).

While this isn't a perfect fix (and is likely slow) it works for my needs and maybe someone else who has a similar problem will find it useful.

answered Jun 3, 2016 at 23:42

Andrew

7441 gold badge6 silver badges20 bronze badges

2 Comments

Merlin Over a year ago

To each his own, I have been using pandas since .06 and I havent change source code. I would recommend not doing this too often. Pandas is an evolving beast, You will find yourself modifying code on upgrades anyway. And, then tracking all the places you modified the source code to boot. But, I have seen this done before. good luck.

Andrew Over a year ago

@Merlin After a weekend away from this I think I understand what your answer was. Just read in the thing as a text file and then convert to the required dtype. I agree with you about not changing source code (in fact this is the first time I have ever modified another person's source code). The only benefit I see to doing it this way over what you proposed is that this is a cython implementation, so the loop should be faster. If you want to modify your answer to show how to do the dtype conversion after reading in the text I would accept your answer instead, as that is more manageable.

Collectives™ on Stack Overflow

using pandas read_csv with missing data

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related