I am attempting to read a csv file where some rows may be missing chunks of data.
This seems to be causing a problem with the pandas read_csv function when you specify the dtype. The problem appears that in order to convert from the str to whatever the dtype specifies pandas just tries to cast it directly. Therefore, if something is missing things break down.
A MWE follows (this MWE uses StringIO in place of a true file; however, the issue also happens with a real file being used)
import pandas as pd
import numpy as np
import io
datfile = io.StringIO("12 23 43| | 37| 12.23| 71.3\n12 23 55|X| | | 72.3")
names = ['id', 'flag', 'number', 'data', 'data2']
dtypes = [np.str, np.str, np.int, np.float, np.float]
dform = {name: dtypes[ind] for ind, name in enumerate(names)}
colconverters = {0: lambda s: s.strip(), 1: lambda s: s.strip()}
df = pd.read_table(datfile, sep='|', dtype=dform, converters=colconverters, header=None,
index_col=0, names=names, na_values=' ')
The error I get when I run this is
Traceback (most recent call last):
File "pandas/parser.pyx", line 1084, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:12580)
TypeError: Cannot cast array from dtype('O') to dtype('int64') according to the rule 'safe'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/aliounis/Repos/stellarpy/source/mwe.py", line 15, in <module>
index_col=0, names=names, na_values=' ')
File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 325, in _read
return parser.read()
File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
File "pandas/parser.pyx", line 904, in pandas.parser.TextReader._read_rows (pandas/parser.c:10022)
File "pandas/parser.pyx", line 1011, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:11397)
File "pandas/parser.pyx", line 1090, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:12656)
ValueError: invalid literal for int() with base 10: ' '
Is there someway I can fix this. I looked through the documentation but didn't see anything that looked like it would directly address this solution. Is this just a bug that needs to be reported to panda?