handling exception in reading data with pandas.read_csv()

Question

I would like to open a csv file in pieces with pd.read_csv(path, chunksize = N) until the end of it in a quite elegant and efficient way. The problem is that once the pointer is out of the file the following message of error takes place:

df.get_chunk()
Traceback (most recent call last):

File "<ipython-input-115-061ea8dbcbad>", line 1, in <module>
df.get_chunk()

File "C:\Users\fedel\Anaconda2\lib\site-packages\pandas\io\parsers.py", line 784, in get_chunk
return self.read(nrows=size)

File "C:\Users\fedel\Anaconda2\lib\site-packages\pandas\io\parsers.py", line 763, in read
ret = self._engine.read(nrows)

File "C:\Users\fedel\Anaconda2\lib\site-packages\pandas\io\parsers.py", line 1213, in read
data = self._reader.read(nrows)

File "pandas\parser.pyx", line 766, in pandas.parser.TextReader.read (pandas\parser.c:7988)

File "pandas\parser.pyx", line 813, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:8629)

StopIteration

and the code can't continue anymore!

I believe that a try/except statement will avoid me that message hence the code will keep going with the next issues. Say that I have a python DataFrame like that one you can generate with the following lines of code

path = r"C:\Users\fedel\Desktop" + '\\fileName.csv'
pd.DataFrame( np.random.randn(30, 3), columns = list('abc')).to_csv(path, index = False)
df = pd.read_csv(path, chunksize = 6)

I think that a statement like the following one could avoid that error and let the code continue with the next issues

while True:
     try:
        df.get_chunk()
     except TypeOfError:
        funcyfunction()

could you fix this last exception handling lines of code, please?

Try: df = pd.read_csv(path, chunksize = 6, error_bad_lines=False) to skip the lines causing errors. — Nickil Maveli
– Nickil Maveli, Commented Aug 20, 2016 at 19:09

Nickil Maveli · Accepted Answer · 2016-08-20 20:05:06Z

1

You could try:

df = pd.read_csv(path, chunksize=6)
for chunk in df:
    print(chunk)

Incase you want to carry out operations inside each chunk, you can do:

for chunk in df:
    chunk['d'] = chunk[['a', 'b']].mean(axis=1)    # Average of columns 'a' and 'b'
    print(chunk)

edited Aug 20, 2016 at 20:05

answered Aug 20, 2016 at 19:41

Nickil Maveli

29.8k10 gold badges86 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Stefano Fedele Over a year ago

In that way I can just see the "chunk", I would like even to manage them

Nickil Maveli Over a year ago

Do you want to combine these chunks into a single dataframe object?

Stefano Fedele Over a year ago

Say that for each cycle I would like to select column 'a' and column 'b' only and get a third pandas.Series that is the average of the rows

Nickil Maveli Over a year ago

I've added an example for that case.

Collectives™ on Stack Overflow

handling exception in reading data with pandas.read_csv()

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related