0

I intend to read a file, which is about 500MB in total, into a dict according to the key in each line. The code snippet is as follows:

f2 = open("ENST-NM-chr-name.txt", "r")   # small amount
lines = [l.strip() for l in f2.readlines() if l.strip()]
sample = dict([(l.split("\t")[2].strip("\""), l) for l in lines])    ## convert [(1,2), (3,4)] to {1:2, 3:4}

When running on a machine with memory of 4GB, the python complains Memory Error. If I change the evaluation expression of sample variable to [l for l in lines], it works fine.

At first, I thought it was due to the split method that was consuming lots of memory, so I adjust my code to this:

def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

...

sample = dict([(l[find_nth(l, "\t", 4):].strip(), l) for l in lines])

But it turns out the same.

A new discovery is that it will run normally without OOM provided I remove the dict() conversion regardless of the code logic.

Could anyone give me some idea on this problem?

8
  • Somewhere on this site is a question about how much memory a dict takes, and it's much more than you would expect. Commented Apr 15, 2015 at 3:39
  • Could you give out the specific URL link related to what you've mentioned? Thanks. @MarkRansom Commented Apr 15, 2015 at 3:41
  • If I could remember it, I would have done so already. Sorry. Commented Apr 15, 2015 at 3:42
  • 1
    stackoverflow.com/questions/10264874/… Commented Apr 15, 2015 at 3:43
  • Also, are you reading tab-separated values? Commented Apr 15, 2015 at 3:48

2 Answers 2

2

You’re creates a list containing every line, which will continue to exist until lines goes out of scope, then creating another big list of entirely different strings based off of it, then a dict off of that before it can go out of memory. Just build the dict in one step.

with open("ENST-NM-chr-name.txt") as f:
    sample = {}

    for l in f:
        l = l.strip()

        if l:
            sample[l.split("\t")[2].strip('"')] = l

You can achieve about the same effect by using a generator expression instead of a list comprehension, but it feels nicer (to me) not to strip twice.

Sign up to request clarification or add additional context in comments.

2 Comments

I removed the explicit r mode, because it is the default.
I agree, about not repeating strip. Another option is to do map(str.strip, f) (in Python 3), or itertools.imap(…) (in Python 2).
1

What if you turn your list into a generator, and your dict into a lovely dictionary comprehension:

f2 = open("ENST-NM-chr-name.txt", "r")   # small amount
lines = (l.strip() for l in f2 if l.strip())
sample = {line.split('\t')[2].strip('\"'): line for line in lines}

Line 2 above was mistakenly lines = (l.strip() for l in f2.readlines() if l.strip())

Do a generator and a dict comprehension perhaps (somehow) alleviate the memory requirements?

6 Comments

This does not solve the (likely) problem of reading the whole file into memory (which is an unnecessary waste, in any case). minitech's answer avoids this.
Doesn't using a generator to read lines do exactly that? Since it doesn't build a list of lines in memory?
You are not using a generator, with your f2.readlines(): this puts the whole file in memory (since it builds a list with all the lines). readlines() can be avoided, most of the time. Even if you avoided it here by doing for l in f2, (where f2 is an iterator) you would still be storing all the lines in your lines. Again, minitech's answer is the way to go (smallest memory footprint possible).
@EOL: lines = (l.strip() for l in f2 if l.strip()) would make it a lazily-evaluated generator. (Note parentheses instead of square brackets.)
Oops, I meant to remove readlines()... and do l in f2, as per minitech's comment.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.