0

I'm working on an app that processes a lot of data.

.... and keeps running my computer out of memory. :(

Python has a huge amount of memory overhead on variables (as per sys.getsizeof()). A basic tuple with one integer in it takes up 56 bytes, for example. An empty list, 64 bytes. Serious overhead.

Numpy arrays are great for reducing overhead. But they're not designed to grow efficiently (see Fastest way to grow a numpy numeric array). Array (https://docs.python.org/3/library/array.html) seems promising, but it's 1d. My data is 2d, with an arbitrary number of rows and a column width of 3 floats (ideally float32) for one array, and a column width of two ints (ideally uint32) for the other. Obviously, using ~80 bytes of python structure to store 12 or 8 bytes of data per row is going to total my memory consumption.

Is the only realistic way to keep memory usage down in Python to "fake" 2d, aka by addressing the array as arr[row*WIDTH+column] and counting rows as len(arr)/WIDTH?

3
  • There are a lot of ways to create arrays in numpy. Where is your data coming from? Are you computing it, reading it from a socket, pulling it from a CSV file, or ... ? Commented Jul 22, 2017 at 3:06
  • ints, floats, bytes, strings? Commented Jul 22, 2017 at 4:28
  • @Austin: I'm parsing it with regexes out of json files. Some points and lines get thrown away in processing, but the files are massive. wwii: I mentioned the datatypes in the question - ideally float32 and uint32. Commented Jul 22, 2017 at 11:14

1 Answer 1

1

Based on your comments, I'd suggest that you split your task into two parts:

1) In part 1, parse the JSON files using regexes and generate two CSV files in simple format: no headers, no spaces, just numbers. This should be quick and performant, with no memory issues: read text in, write text out. Don't try to keep anything in memory that you don't absolutely have to.

2) In part 2, use pandas read_csv() function to slurp in the CSV files directly. (Yes, pandas! You've probably already got it, and it's hella fast.)

Sign up to request clarification or add additional context in comments.

2 Comments

Actually, that's a great idea; it'll save a lot of time when processing needs to be re-run (which during initial debugging is often!). Thanks! (so long as the data structure returned by pandas is memory efficient :) )
For the record, I'm skipping pandas - after messing around with it for a while I came to the conclusion that numpy's genfromtxt is much better.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.