3

I have a ~1.81GB CSV file which has ~49m rows. It has only one column that contains 38-character strings.

I am reading this file with read_csv on a Digital Ocean VPS (Ubuntu 12.04.4, Python 2.7, pandas 0.18.0, 512MB RAM). I read 5000 lines at a time. However, it started raising errors at skiprows = 2800000. Here's the code I'm testing on a rebooted computer, freshly started Python:

>>> pd.read_csv(filename, skiprows=2800000, nrows=5000, header=None)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ayhan/.conda/envs/swrm/lib/python2.7/site-packages/pandas/io/parsers.py", line 529, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/ayhan/.conda/envs/swrm/lib/python2.7/site-packages/pandas/io/parsers.py", line 295, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/ayhan/.conda/envs/swrm/lib/python2.7/site-packages/pandas/io/parsers.py", line 608, in __init__
    self.options, self.engine = self._clean_options(options, engine)
  File "/home/ayhan/.conda/envs/swrm/lib/python2.7/site-packages/pandas/io/parsers.py", line 731, in _clean_options
    skiprows = set() if skiprows is None else set(skiprows)
MemoryError

If I run it with skiprows=1000000 it works fine. If I try skiprows=1500000 it raises the error again, which is strange because the error started after reaching 2800000. It went through every 5000 multipliers before that with no problem. Any idea why this happens?

The code works fine on my personal computer:

df = pd.read_csv(filename, skiprows=2800000, nrows=5000, header=None)

df.memory_usage()
Out[25]: 
Index       72
0        40000
dtype: int64

Edit:

The original loop goes like this:

current_chunk = 560 
chnksize = 5000

for chunk in range(current_chunk, 1000):
    df = pd.read_csv(filename, skiprows=chnksize*chunk, nrows=chnksize, header=None)
    out = "chunk_" + format(chunk, "06d")
    short_ids = df[0].str.slice(-11)

It queries the short_id from an API and appends the result to a file. But the code piece I gave at the top raises the error by itself.

4
  • can you show your loop? Commented Apr 26, 2016 at 20:26
  • Use the code formatting button for error messages, not the quote button. Commented Apr 26, 2016 at 20:30
  • @ayhan, you are misusing skiprows parameter, you should be using chunksize Commented Apr 26, 2016 at 20:49
  • @ayhan, i know, it's not always easy to find in pandas docs how to use some features ;) Commented Apr 26, 2016 at 21:00

2 Answers 2

3

Pandas is using a bafflingly memory-intensive way of implementing skiprows. In pandas.io.parsers.TextFileReader._clean_options:

if com.is_integer(skiprows):
    skiprows = lrange(skiprows)
skiprows = set() if skiprows is None else set(skiprows)

lrange(n) does list(range(n)), so this is basically doing skiprows = set(list(range(skiprows))). It's building two giant lists and a set, each containing 2.8 million integers! I suppose they never expected people to try to skip so many rows.


If you want to read through a file in chunks, calling read_csv repeatedly with different values of skiprows is an inefficient way to do it. You can pass read_csv a chunksize option and then iterate over the returned TextFileReader in chunks:

In [138]: reader = pd.read_table('tmp.sv', sep='|', chunksize=4)

In [139]: reader
Out[139]: <pandas.io.parsers.TextFileReader at 0x121159a50>

In [140]: for chunk in reader:
   .....:     print(chunk)
   .....: 
   Unnamed: 0         0         1         2         3
0           0  0.469112 -0.282863 -1.509059 -1.135632
1           1  1.212112 -0.173215  0.119209 -1.044236
2           2 -0.861849 -2.104569 -0.494929  1.071804
3           3  0.721555 -0.706771 -1.039575  0.271860
   Unnamed: 0         0         1         2         3
0           4 -0.424972  0.567020  0.276232 -1.087401
1           5 -0.673690  0.113648 -1.478427  0.524988
2           6  0.404705  0.577046 -1.715002 -1.039268
3           7 -0.370647 -1.157892 -1.344312  0.844885
   Unnamed: 0         0        1         2         3
0           8  1.075770 -0.10905  1.643563 -1.469388
1           9  0.357021 -0.67460 -1.776904 -0.968914

or pass iterator=True and use get_chunk to get chunks of specified sizes:

In [141]: reader = pd.read_table('tmp.sv', sep='|', iterator=True)

In [142]: reader.get_chunk(5)
Out[142]: 
   Unnamed: 0         0         1         2         3
0           0  0.469112 -0.282863 -1.509059 -1.135632
1           1  1.212112 -0.173215  0.119209 -1.044236
2           2 -0.861849 -2.104569 -0.494929  1.071804
3           3  0.721555 -0.706771 -1.039575  0.271860
4           4 -0.424972  0.567020  0.276232 -1.087401
Sign up to request clarification or add additional context in comments.

1 Comment

@ayhan: I'm not aware of any better way to do it. You could pass a larger chunk size, but that's likely to run into memory problems again.
3

The problem is that skiprows doesn't prevent to load data in memory, giving you a memory error. for your problem, you must use the chunk parameter of read_csv, instead of nrows.

1 Comment

has this been updated in the new versions?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.