Avoid huge memory usage in Python when substituting elements in large list of strings

Question

so I'm trying to parse a large list of strings with comma-separated values into (a list of) lists. I'm trying to do this "in place", such as not to duplicate an already large object in memory.

Now, ideally, during and after parsing, the only additional memory required would be the overhead of representing the original strings as lists of strings. But what actually happens is much, much worse.

E.g. this list of strings occupies ~1.36 GB of memory:

import psutil

l = [f"[23873498uh3149ubn34, 59ubn23459un3459, un3459-un345, 9u3n45iu9n345, {i}]" for i in range(10_000_000)]
psutil.Process().memory_info().rss / 1024**3

>> 1.3626747131347656

The desired end result of the parsing would take up somewhat more (~1.8 GB):

import psutil

l = [["23873498uh3149ubn34", "59ubn23459un3459", "un3459-un345", "9u3n45iu9n345", str(i)] for i in range(10_000_000)]
psutil.Process().memory_info().rss / 1024**3

1.7964096069335938

However, actually doing the parsing of the original strings requires a whopping 5 GB of memory, i.e. much more even than the initial strings plus the final lists together:

import psutil

l = [f"[23873498uh3149ubn34, 59ubn23459un3459, un3459-un345, 9u3n45iu9n345, {i}]" for i in range(10_000_000)]

for i, val in enumerate(l):
    l[i] = val.split(", ")
    
psutil.Process().memory_info().rss / 1024**3

4.988628387451172

Now, I understand that pure Python strings and lists themselves are not terribly efficient (memory-wise), but I fail to understand that huge, huge gap between the memory required for the final result (1.8GB), and what's used in the process to get there (5GB).

Can anyone explain what exactly is going on, whether it is possible to actually modify lists "in place" (while actually freeing the memory of replaced values), and whether there is a better in-memory way to do this?

Where do you get those values from? Certainly not from a for loop but from a file or something. Why worry about performance of a scenario that doesn't even represent the real case? And: what's the problem with 5 GB of RAM? Doesn't your PC have that much? — Thomas Weller
– Thomas Weller, Commented Mar 11, 2022 at 17:58
numbers = list(range(10_000_000)) print(psutil.Process().memory_info().rss / 1024 ** 3) returned ' '1.6968574523925781' contributed by using enumerate? — Carl_M
– Carl_M, Commented Mar 11, 2022 at 18:25

Tim Peters · Accepted Answer · 2022-03-11 19:47:45Z

1

Reformatting so it's readable, your estimate of the memory it "should take" is wildly optimistic:

l = [["23873498uh3149ubn34",
      "59ubn23459un3459",
      "un3459-un345",
      "9u3n45iu9n345",
      str(i)] for i in range(10_000_000)]

That's way off base. For example, there is only one string object created for "9u3n45iu9n345", which is shared across the 10 million lists.

>>> print(l[10][2] is l[56000][2]) # first indices are arbitrary
True

Your actual code creates 10 million distinct string objects at each index position.

Try running

sys._debugmallocstats()

from time to time to get more clues about what the memory is being used for.

Here under 3.10.1, on 64-bit Windows, some summary output at the end shows that allocated RAM is being used very efficiently (nearly all allocated blocks are in active use):

# arenas allocated total           =                6,601
# arenas reclaimed                 =                1,688
# arenas highwater mark            =                4,913
# arenas allocated current         =                4,913
4913 arenas * 1048576 bytes/arena  =        5,151,653,888

# bytes in allocated blocks        =        5,129,947,088
# bytes in available blocks        =              701,584
53 unused pools * 16384 bytes      =              868,352
# bytes lost to pool headers       =           15,090,192
# bytes lost to quantization       =            5,046,672
# bytes lost to arena alignment    =                    0
Total                              =        5,151,653,888

arena map counts
# arena map mid nodes              =                    1
# arena map bot nodes              =                   20

# bytes lost to arena map root     =                8,192
# bytes lost to arena map mid      =                8,192
# bytes lost to arena map bot      =               40,960
Total                              =               57,344

Note: an easy way to save about 800 million bytes:

    l[i] = tuple(val.split(", ")) # use this
    # l[i] = val.split(", ")      # instead of this

Lists "overallocate" so that a sequence of appends runs in amortized O(1) time. Tuples don't - they allocate only as much as needed to hold their initial content (which is also their final content, since tuples are immutable).

edited Mar 11, 2022 at 19:47

answered Mar 11, 2022 at 18:14

Tim Peters

71.4k14 gold badges133 silver badges140 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

thomas Over a year ago

Yes, you're totally right, there is nothing weird going on, just the fact that in my second example (estimating how much memory the strings should occupy), the strings are being interned, while that's not the case when they're properly parsed. So it's simply Python being generally rather inefficient with strings.

Tim Peters Over a year ago

"Interning" is the wrong word here, but I know what you mean ;-) In general, you can't create a string object on current 64-bit boxes that requires less than 64 bytes (due to the need to align memory allocations to 16-byte boundaries). So 10 million lists times 5 strings per list times at least 64 bytes per string is over 3 gigabytes right there. I was more surprised by how much RAM switching to tuples saved. I opened an issue report on that, because I think it's unreasonable: bugs.python.org/issue46990?

thomas Over a year ago

FWIW, while tuples reduce memory usage by about 750MB for the above 10M lists of strings (4.23GB in total), explicitly interning the strings saves another 2.4GB (1.8GB in total). Now, this is probably a (very) degenerate case, as most strings are in fact duplicates, and if this was known about the data some kind of dictionary-encoding would be appropriate, but still interesting: from sys import intern; ... l[i] = tuple(intern(x) for x in val.split(", "))

deadshot · Accepted Answer · 2022-03-11 18:01:17Z

1

If you are not consuming entire list at once, You can use generator expression

import psutil

data = (f"[23873498uh3149ubn34, 59ubn23459un3459, un3459-un345, 9u3n45iu9n345, {i}]" for i in range(10_000_000))
res = (x.split(', ') for x in data)
    
print(psutil.Process().memory_info().rss / 1024**3) # 0.08874893188476562

answered Mar 11, 2022 at 18:01

deadshot

9,0774 gold badges23 silver badges40 bronze badges

1 Comment

thomas Over a year ago

Yes, of course. But the question was specifically about what's going on with all that memory for strings. Also, the last statement in your example doesn't really measure anything, since you haven't generated or processed a single string yet (the 0.0887 GB is the same amount you would get measuring it right after the import of psutil.

Thomas Weller · Accepted Answer · 2022-03-11 18:20:59Z

1

Measuring RSS is wrong. RSS is that part of virtual memory that currently resides in physical RAM. If, for example, I have only 4 GB of physical RAM, I'll never measure more than 4 GB. So: if you want to keep that number small, remove RAM modules from your computer.

The data in the strings is about 650 MB.

A list needs 56 bytes in memory. You are creating 10.000.000 of them, so the lists account for 560 MB of additional memory.

A string takes 49 bytes in memory, plus the characters themselves. After splitting the long string into pieces, you have 50.000.000 strings, so they account for another 2450 MB.

Another problem is garbage collection. With PSUtils, you're asking the operating system about how much memory it has given to Python. However, Python might just have requested it, because it recognized that you're dealing with a lot of data. However, some of it can still be considered to be free internally for Python.

Other than that, I claim you should do real world tests. A good CSV parser will not load the entire file into memory and then split the string. Instead it will have a parser that can stream the data and split at a comma whenever it finds one.

To get the overhead for the classes, I used

import sys
print(sys.getsizeof([]))
print(sys.getsizeof(""))

answered Mar 11, 2022 at 18:20

Thomas Weller

61.2k23 gold badges143 silver badges263 bronze badges

2 Comments

Tim Peters Over a year ago

While less convenient, sys._debugmallocstats() (see my answer) is more accurate. For example, it's impossible for a list object to consume just 56 bytes these days,. getsizeof() has no way to know this, but under current releases of CPython we have to align allocated memory blocks to 16-byte boundaries. So the least memory a list can consume now is 64 bytes (although 8 bytes sit unused). _debugmallocstats() gives an exact account of the RAM "small objects" consume, regardless of whether for data, cyclic gc links, alignment padding. ...

thomas Over a year ago

However you measure it, the amount of memory used is still far greater in the parsed case compared to my overly optimistic estimation (see Tim Peter's answer above). Also, regarding garbage collection etc., I know that some of the memory could indeed be reused by Python even though it's not going to be "returned" to the OS, but that doesn't change the fact that the process gets killed if it requests too much memory in the first place. Finally, I also know I could simply stream process the strings, but the question was specifically about what happens in memory.

Collectives™ on Stack Overflow

Avoid huge memory usage in Python when substituting elements in large list of strings

3 Answers 3

3 Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related