Memory-efficient 2d growable array in python?

Question

I'm working on an app that processes a lot of data.

.... and keeps running my computer out of memory. :(

Python has a huge amount of memory overhead on variables (as per sys.getsizeof()). A basic tuple with one integer in it takes up 56 bytes, for example. An empty list, 64 bytes. Serious overhead.

Numpy arrays are great for reducing overhead. But they're not designed to grow efficiently (see Fastest way to grow a numpy numeric array). Array (https://docs.python.org/3/library/array.html) seems promising, but it's 1d. My data is 2d, with an arbitrary number of rows and a column width of 3 floats (ideally float32) for one array, and a column width of two ints (ideally uint32) for the other. Obviously, using ~80 bytes of python structure to store 12 or 8 bytes of data per row is going to total my memory consumption.

Is the only realistic way to keep memory usage down in Python to "fake" 2d, aka by addressing the array as arr[row*WIDTH+column] and counting rows as len(arr)/WIDTH?

There are a lot of ways to create arrays in numpy. Where is your data coming from? Are you computing it, reading it from a socket, pulling it from a CSV file, or ... ? — aghast
– aghast, Commented Jul 22, 2017 at 3:06
@Austin: I'm parsing it with regexes out of json files. Some points and lines get thrown away in processing, but the files are massive. wwii: I mentioned the datatypes in the question - ideally float32 and uint32. — KarenRei
– KarenRei, Commented Jul 22, 2017 at 11:14

aghast · Accepted Answer · 2017-07-22 16:44:41Z

1

Based on your comments, I'd suggest that you split your task into two parts:

1) In part 1, parse the JSON files using regexes and generate two CSV files in simple format: no headers, no spaces, just numbers. This should be quick and performant, with no memory issues: read text in, write text out. Don't try to keep anything in memory that you don't absolutely have to.

2) In part 2, use pandas read_csv() function to slurp in the CSV files directly. (Yes, pandas! You've probably already got it, and it's hella fast.)

answered Jul 22, 2017 at 16:44

aghast

15.4k4 gold badges31 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

KarenRei Over a year ago

Actually, that's a great idea; it'll save a lot of time when processing needs to be re-run (which during initial debugging is often!). Thanks! (so long as the data structure returned by pandas is memory efficient :) )

KarenRei Over a year ago

For the record, I'm skipping pandas - after messing around with it for a while I came to the conclusion that numpy's genfromtxt is much better.

Collectives™ on Stack Overflow

Memory-efficient 2d growable array in python?

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related