Memory error when processing files in Python

Question

I intend to read a file, which is about 500MB in total, into a dict according to the key in each line. The code snippet is as follows:

f2 = open("ENST-NM-chr-name.txt", "r")   # small amount
lines = [l.strip() for l in f2.readlines() if l.strip()]
sample = dict([(l.split("\t")[2].strip("\""), l) for l in lines])    ## convert [(1,2), (3,4)] to {1:2, 3:4}

When running on a machine with memory of 4GB, the python complains Memory Error. If I change the evaluation expression of sample variable to [l for l in lines], it works fine.

At first, I thought it was due to the split method that was consuming lots of memory, so I adjust my code to this:

def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

...

sample = dict([(l[find_nth(l, "\t", 4):].strip(), l) for l in lines])

But it turns out the same.

A new discovery is that it will run normally without OOM provided I remove the dict() conversion regardless of the code logic.

Could anyone give me some idea on this problem?

Somewhere on this site is a question about how much memory a dict takes, and it's much more than you would expect. — Mark Ransom
– Mark Ransom, Commented Apr 15, 2015 at 3:39
Could you give out the specific URL link related to what you've mentioned? Thanks. @MarkRansom — Judking
– Judking, Commented Apr 15, 2015 at 3:41
If I could remember it, I would have done so already. Sorry. — Mark Ransom
– Mark Ransom, Commented Apr 15, 2015 at 3:42

Eric O. Lebigot · Accepted Answer · 2015-04-15 03:52:48Z

2

You’re creates a list containing every line, which will continue to exist until lines goes out of scope, then creating another big list of entirely different strings based off of it, then a dict off of that before it can go out of memory. Just build the dict in one step.

with open("ENST-NM-chr-name.txt") as f:
    sample = {}

    for l in f:
        l = l.strip()

        if l:
            sample[l.split("\t")[2].strip('"')] = l

You can achieve about the same effect by using a generator expression instead of a list comprehension, but it feels nicer (to me) not to strip twice.

edited Apr 15, 2015 at 3:52

Eric O. Lebigot

95k49 gold badges223 silver badges263 bronze badges

answered Apr 15, 2015 at 3:45

Ry-♦

226k56 gold badges496 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Eric O. Lebigot Over a year ago

I removed the explicit r mode, because it is the default.

Eric O. Lebigot Over a year ago

I agree, about not repeating strip. Another option is to do map(str.strip, f) (in Python 3), or itertools.imap(…) (in Python 2).

Eric O. Lebigot · Accepted Answer · 2015-04-16 07:44:30Z

1

What if you turn your list into a generator, and your dict into a lovely dictionary comprehension:

f2 = open("ENST-NM-chr-name.txt", "r")   # small amount
lines = (l.strip() for l in f2 if l.strip())
sample = {line.split('\t')[2].strip('\"'): line for line in lines}

Line 2 above was mistakenly lines = (l.strip() for l in f2.readlines() if l.strip())

Do a generator and a dict comprehension perhaps (somehow) alleviate the memory requirements?

edited Apr 16, 2015 at 7:44

Eric O. Lebigot

95k49 gold badges223 silver badges263 bronze badges

answered Apr 15, 2015 at 3:44

supermitch

3,1725 gold badges30 silver badges36 bronze badges

6 Comments

Eric O. Lebigot Over a year ago

This does not solve the (likely) problem of reading the whole file into memory (which is an unnecessary waste, in any case). minitech's answer avoids this.

supermitch Over a year ago

Doesn't using a generator to read lines do exactly that? Since it doesn't build a list of lines in memory?

Eric O. Lebigot Over a year ago

You are not using a generator, with your f2.readlines(): this puts the whole file in memory (since it builds a list with all the lines). readlines() can be avoided, most of the time. Even if you avoided it here by doing for l in f2, (where f2 is an iterator) you would still be storing all the lines in your lines. Again, minitech's answer is the way to go (smallest memory footprint possible).

Ry- Over a year ago

@EOL: lines = (l.strip() for l in f2 if l.strip()) would make it a lazily-evaluated generator. (Note parentheses instead of square brackets.)

supermitch Over a year ago

Oops, I meant to remove readlines()... and do l in f2, as per minitech's comment.

|

Collectives™ on Stack Overflow

Memory error when processing files in Python

2 Answers 2

2 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related