1

I would like to open a csv file in pieces with pd.read_csv(path, chunksize = N) until the end of it in a quite elegant and efficient way. The problem is that once the pointer is out of the file the following message of error takes place:

df.get_chunk()
Traceback (most recent call last):

File "<ipython-input-115-061ea8dbcbad>", line 1, in <module>
df.get_chunk()

File "C:\Users\fedel\Anaconda2\lib\site-packages\pandas\io\parsers.py", line 784, in get_chunk
return self.read(nrows=size)

File "C:\Users\fedel\Anaconda2\lib\site-packages\pandas\io\parsers.py", line 763, in read
ret = self._engine.read(nrows)

File "C:\Users\fedel\Anaconda2\lib\site-packages\pandas\io\parsers.py", line 1213, in read
data = self._reader.read(nrows)

File "pandas\parser.pyx", line 766, in pandas.parser.TextReader.read (pandas\parser.c:7988)

File "pandas\parser.pyx", line 813, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:8629)

StopIteration

and the code can't continue anymore!

I believe that a try/except statement will avoid me that message hence the code will keep going with the next issues. Say that I have a python DataFrame like that one you can generate with the following lines of code

path = r"C:\Users\fedel\Desktop" + '\\fileName.csv'
pd.DataFrame( np.random.randn(30, 3), columns = list('abc')).to_csv(path, index = False)
df = pd.read_csv(path, chunksize = 6)

I think that a statement like the following one could avoid that error and let the code continue with the next issues

while True:
     try:
        df.get_chunk()
     except TypeOfError:
        funcyfunction()

could you fix this last exception handling lines of code, please?

4
  • Try: df = pd.read_csv(path, chunksize = 6, error_bad_lines=False) to skip the lines causing errors. Commented Aug 20, 2016 at 19:09
  • No, I can't! It still provides me a message of error Commented Aug 20, 2016 at 19:12
  • Could you post the full traceback error? Commented Aug 20, 2016 at 19:13
  • Sure, I just added it on the principal question Commented Aug 20, 2016 at 19:17

1 Answer 1

1

You could try:

df = pd.read_csv(path, chunksize=6)
for chunk in df:
    print(chunk)

Incase you want to carry out operations inside each chunk, you can do:

for chunk in df:
    chunk['d'] = chunk[['a', 'b']].mean(axis=1)    # Average of columns 'a' and 'b'
    print(chunk)
Sign up to request clarification or add additional context in comments.

4 Comments

In that way I can just see the "chunk", I would like even to manage them
Do you want to combine these chunks into a single dataframe object?
Say that for each cycle I would like to select column 'a' and column 'b' only and get a third pandas.Series that is the average of the rows
I've added an example for that case.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.