I'm using csv.DictReader in Python 3 to process a very large CSV file. And I found a strange memory usage issue.
The code is like this
import os
import csv
import psutil # require pip install
# the real CSV file is replaced with a call to this function
def generate_rows():
for k in range(400000):
yield ','.join(str(i * 10 + k) for i in range(35))
def memory_test():
proc = psutil.Process(os.getpid())
print('BEGIN', proc.memory_info().rss)
fieldnames = ['field_' + str(i) for i in range(35)]
reader = csv.DictReader(generate_rows(), fieldnames)
result = []
for row in reader:
result.append(row)
print(' END', proc.memory_info().rss)
return result
if __name__ == '__main__':
memory_test()
In the code above the program will print the memory usage (which requires psutil installed) and the result is like
BEGIN 12623872
END 2006462464
You can see by the end of the process it would take nearly 2GB memory.
But if I copy each row, the memory usage become lower.
def memory_test():
proc = psutil.Process(os.getpid())
print('BEGIN', proc.memory_info().rss)
fieldnames = ['field_' + str(i) for i in range(35)]
reader = csv.DictReader(generate_rows(), fieldnames)
result = []
for row in reader:
# MAKE A COPY
row_copy = {key: value for key, value in row.items()}
result.append(row_copy)
print(' END', proc.memory_info().rss)
return result
The result is like
BEGIN 12726272
END 1289912320
It only takes about 1.29G memory, much less.
(I tested the code on 64-bit Ubuntu and got those results.)
Why does this happen? Is it proper to copy the rows from a DictReader?
result = list(csv.DictReader(generate_rows(), fieldnames))-- this avoids appending to a list several times, which causes CPython to keep reallocating memory to increase the size of the list.rowobject in the real code.result = [change_row(row) for row in csv.DictReader(...)], if possible