4

I'm using csv.DictReader in Python 3 to process a very large CSV file. And I found a strange memory usage issue.

The code is like this

import os
import csv
import psutil # require pip install

# the real CSV file is replaced with a call to this function
def generate_rows():
    for k in range(400000):
        yield ','.join(str(i * 10 + k) for i in range(35))


def memory_test():
    proc = psutil.Process(os.getpid())
    print('BEGIN', proc.memory_info().rss)

    fieldnames = ['field_' + str(i) for i in range(35)]
    reader = csv.DictReader(generate_rows(), fieldnames)
    result = []
    for row in reader:
        result.append(row)

    print('  END', proc.memory_info().rss)
    return result


if __name__ == '__main__':
    memory_test()

In the code above the program will print the memory usage (which requires psutil installed) and the result is like

BEGIN 12623872
  END 2006462464

You can see by the end of the process it would take nearly 2GB memory.

But if I copy each row, the memory usage become lower.

def memory_test():
    proc = psutil.Process(os.getpid())
    print('BEGIN', proc.memory_info().rss)

    fieldnames = ['field_' + str(i) for i in range(35)]
    reader = csv.DictReader(generate_rows(), fieldnames)
    result = []
    for row in reader:
        # MAKE A COPY
        row_copy = {key: value for key, value in row.items()}
        result.append(row_copy)

    print('  END', proc.memory_info().rss)
    return result

The result is like

BEGIN 12726272
  END 1289912320

It only takes about 1.29G memory, much less.

(I tested the code on 64-bit Ubuntu and got those results.)

Why does this happen? Is it proper to copy the rows from a DictReader?

3
  • Btw, if you want to make this code more efficient you'd write result = list(csv.DictReader(generate_rows(), fieldnames)) -- this avoids appending to a list several times, which causes CPython to keep reallocating memory to increase the size of the list. Commented Aug 8, 2018 at 9:08
  • @elias Indeed. I just kept the code the way it looks like because there're some changes to make to the row object in the real code. Commented Aug 8, 2018 at 23:51
  • It might still be worth to move those changes into a function and turn it into a list comprehension result = [change_row(row) for row in csv.DictReader(...)], if possible Commented Aug 9, 2018 at 7:59

1 Answer 1

3

If you print(row), you will find that row is an OrderedDict. In your second example, you replace this OrderedDict with a normal dict. They are different.

You can get the same results by using OrderedDict in second example:

for row in reader:
    from collections import OrderedDict
    # MAKE A COPY
    row_copy = OrderedDict({key: value for key, value in row.items()})
    result.append(row_copy)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.