I have a ~1.81GB CSV file which has ~49m rows. It has only one column that contains 38-character strings.
I am reading this file with read_csv on a Digital Ocean VPS (Ubuntu 12.04.4, Python 2.7, pandas 0.18.0, 512MB RAM). I read 5000 lines at a time. However, it started raising errors at skiprows = 2800000. Here's the code I'm testing on a rebooted computer, freshly started Python:
>>> pd.read_csv(filename, skiprows=2800000, nrows=5000, header=None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ayhan/.conda/envs/swrm/lib/python2.7/site-packages/pandas/io/parsers.py", line 529, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/ayhan/.conda/envs/swrm/lib/python2.7/site-packages/pandas/io/parsers.py", line 295, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/ayhan/.conda/envs/swrm/lib/python2.7/site-packages/pandas/io/parsers.py", line 608, in __init__
self.options, self.engine = self._clean_options(options, engine)
File "/home/ayhan/.conda/envs/swrm/lib/python2.7/site-packages/pandas/io/parsers.py", line 731, in _clean_options
skiprows = set() if skiprows is None else set(skiprows)
MemoryError
If I run it with skiprows=1000000 it works fine. If I try skiprows=1500000 it raises the error again, which is strange because the error started after reaching 2800000. It went through every 5000 multipliers before that with no problem. Any idea why this happens?
The code works fine on my personal computer:
df = pd.read_csv(filename, skiprows=2800000, nrows=5000, header=None)
df.memory_usage()
Out[25]:
Index 72
0 40000
dtype: int64
Edit:
The original loop goes like this:
current_chunk = 560
chnksize = 5000
for chunk in range(current_chunk, 1000):
df = pd.read_csv(filename, skiprows=chnksize*chunk, nrows=chnksize, header=None)
out = "chunk_" + format(chunk, "06d")
short_ids = df[0].str.slice(-11)
It queries the short_id from an API and appends the result to a file. But the code piece I gave at the top raises the error by itself.
skiprowsparameter, you should be usingchunksize