Memory usage issue with Python csv.DictReader

Question

I'm using csv.DictReader in Python 3 to process a very large CSV file. And I found a strange memory usage issue.

The code is like this

import os
import csv
import psutil # require pip install

# the real CSV file is replaced with a call to this function
def generate_rows():
    for k in range(400000):
        yield ','.join(str(i * 10 + k) for i in range(35))


def memory_test():
    proc = psutil.Process(os.getpid())
    print('BEGIN', proc.memory_info().rss)

    fieldnames = ['field_' + str(i) for i in range(35)]
    reader = csv.DictReader(generate_rows(), fieldnames)
    result = []
    for row in reader:
        result.append(row)

    print('  END', proc.memory_info().rss)
    return result


if __name__ == '__main__':
    memory_test()

In the code above the program will print the memory usage (which requires psutil installed) and the result is like

BEGIN 12623872
  END 2006462464

You can see by the end of the process it would take nearly 2GB memory.

But if I copy each row, the memory usage become lower.

def memory_test():
    proc = psutil.Process(os.getpid())
    print('BEGIN', proc.memory_info().rss)

    fieldnames = ['field_' + str(i) for i in range(35)]
    reader = csv.DictReader(generate_rows(), fieldnames)
    result = []
    for row in reader:
        # MAKE A COPY
        row_copy = {key: value for key, value in row.items()}
        result.append(row_copy)

    print('  END', proc.memory_info().rss)
    return result

The result is like

BEGIN 12726272
  END 1289912320

It only takes about 1.29G memory, much less.

(I tested the code on 64-bit Ubuntu and got those results.)

Why does this happen? Is it proper to copy the rows from a DictReader?

Btw, if you want to make this code more efficient you'd write result = list(csv.DictReader(generate_rows(), fieldnames)) -- this avoids appending to a list several times, which causes CPython to keep reallocating memory to increase the size of the list. — Elias Dorneles
– Elias Dorneles, Commented Aug 8, 2018 at 9:08
@elias Indeed. I just kept the code the way it looks like because there're some changes to make to the row object in the real code. — neuront
– neuront, Commented Aug 8, 2018 at 23:51
It might still be worth to move those changes into a function and turn it into a list comprehension result = [change_row(row) for row in csv.DictReader(...)], if possible — Elias Dorneles
– Elias Dorneles, Commented Aug 9, 2018 at 7:59

Sraw · Accepted Answer · 2018-08-08 09:04:31Z

3

If you print(row), you will find that row is an OrderedDict. In your second example, you replace this OrderedDict with a normal dict. They are different.

You can get the same results by using OrderedDict in second example:

for row in reader:
    from collections import OrderedDict
    # MAKE A COPY
    row_copy = OrderedDict({key: value for key, value in row.items()})
    result.append(row_copy)

answered Aug 8, 2018 at 9:04

Sraw

20.6k11 gold badges61 silver badges93 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Memory usage issue with Python csv.DictReader

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related