2

I need to parse file (~500 Mb) and partially load it to list, I don't need the entire file.

I had a feeling that python allocate much more memory for the list that the size of the data it contains.

I tried to use asizeof of pympler in order to estimate the overkill however it fails with MemoryError which is strange for me, I thought if I have a list in the memory asizeof should just run over it sum the sizes of all entities and that it.

Then I took the chunk of the initial file, and I was shocked by the size of the list asizeof showed me. The list size was three times bigger that the file size.

The question is if the size given by asizeof is correct, what the more efficient way to use list in python. How to check the size of the bigger list when asizeof fails with memoryerror.

5
  • List of what? If it's line-based, just iterate the open file object rather than loading all lines ... Commented Dec 30, 2013 at 16:03
  • What do you actually need to do with the file? If you only need one line at a time, iterate over the file. There's overhead that will make a list of the file's lines significantly bigger than the file itself if the lines are short, but that should be addressed by not loading much of the file at once. Commented Dec 30, 2013 at 16:06
  • @wim, I wish I could iterate, I need the data from file in memory and I think ~500Mb it is not extremely huge file Commented Dec 30, 2013 at 16:07
  • "I need the data from file in memory" - yes, but what do you need to do with it? Commented Dec 30, 2013 at 16:11
  • @user2357112, count occurrences and cooccurences, based on counting do more stuff with the data Commented Dec 30, 2013 at 16:24

1 Answer 1

2

It would be helpful to see the code you use for reading/parsing the file and also how you invoke pympler.asizeof.

asizeof and all other facilities in Pympler work inside the profiled process (using Python's introspection facilities to navigate reference graphs). That means that the profiling overhead might become a problem when sizing reference graphs with large number of nodes (objects) - especially if you are already tight on memory before you start profiling. Be sure to set all=False and code=False when calling asizeof. In any case, please file a bug on GitHub. Maybe one can avoid running out of memory in this scenario.

To the best of my knowledge, the sizes reported by asizeof are accurate as long as sys.getsizeof returns the correct size for the individual objects (assuming Python >= 2.6). You could set align=1 when calling asizeof and see if the numbers are more in line with what you expect.

You could also check the virtual size of your process via your platform's tools or pympler.process:

from pympler.process import ProcessMemoryInfo
pmi = ProcessMemoryInfo()
print ("Process virtual size [Byte]: " + str(pmi.vsz)) 

This metric should always be higher than what asizeof reports when sizing objects.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.