17

I've read here that matplotlib is good at handling large data sets. I'm writing a data processing application and have embedded matplotlib plots into wx and have found matplotlib to be TERRIBLE at handling large amounts of data, both in terms of speed and in terms of memory. Does anyone know a way to speed up (reduce memory footprint of) matplotlib other than downsampling your inputs?

To illustrate how bad matplotlib is with memory consider this code:

import pylab
import numpy
a = numpy.arange(int(1e7)) # only 10,000,000 32-bit integers (~40 Mb in memory)
# watch your system memory now...
pylab.plot(a) # this uses over 230 ADDITIONAL Mb of memory
10
  • 9
    I've always downsampled. Why would you ever need to try to render 10M points on a graph? Commented Feb 12, 2011 at 4:34
  • 1
    matplotlib is slow. It is a known fact. For qt i use the guiqwt package, maybe there is something like it for wx too. Commented Feb 12, 2011 at 15:59
  • 2
    @paul I just wanted to make it easy for my users to explore the data graphically. i.e. when they zoom, I didn't want to have to resample again depending on their zoom bounds, they would see the actual data no matter how they zoomed/panned. Commented Feb 12, 2011 at 18:53
  • If it's feasible, try not plotting things with lines connecting them... plt.plot(a, 'b.') will be much faster than the default plt.plot(a, 'b-'). Commented Feb 12, 2011 at 20:23
  • 3
    @Joe Kington My tests do not show dots to be faster or less memory intensive than lines. :( Commented Feb 13, 2011 at 0:34

3 Answers 3

7

Downsampling is a good solution here -- plotting 10M points consumes a bunch of memory and time in matplotlib. If you know how much memory is acceptable, then you can downsample based on that amount. For example, let's say 1M points takes 23 additional MB of memory and you find it to be acceptable in terms of space and time, therefore you should downsample so that it's always below the 1M points:

if(len(a) > 1M):
   a = scipy.signal.decimate(a, int(len(a)/1M)+1)
pylab.plot(a)

Or something like the above snippet (the above may downsample too aggressively for your taste.)

Sign up to request clarification or add additional context in comments.

2 Comments

A simple decimation is inadequate, and is what Matplotlib does internally so far as I can tell. The reason I don't simply want to decimate, is that you lose the extreme values in each decimation interval. If the signal were to have a sharp spike within an interval you wouldn't see it on the plot at all unless you were very lucky with the intervals. I wrote some code that does this more intelligently, taking the extreme values for each decimation interval instead of the value at the center of the interval (or edge). I'm accepting your answer though since this is in principal what I did.
David - if you solved this 'more intelligently' would you mind sharing? You can mark your own answers as 'solved' and may get a few up votes...
2

I'm often interested in the extreme values too so, before plotting large chunks of data, I proceed in this way:

import numpy as np

s = np.random.normal(size=(1e7,))
decimation_factor = 10 
s = np.max(s.reshape(-1,decimation_factor),axis=1)

# To check the final size
s.shape

Of course np.max is just an example of extreme calculation function.

P.S. With numpy "strides tricks" it should be possible to avoid copying data around during reshape.

Comments

2

I was interested in preserving one side of a log sampled plot so I came up with this: (downsample being my first attempt)

def downsample(x, y, target_length=1000, preserve_ends=0):
    assert len(x.shape) == 1
    assert len(y.shape) == 1
    data = np.vstack((x, y))
    if preserve_ends > 0:
        l, data, r = np.split(data, (preserve_ends, -preserve_ends), axis=1)
    interval = int(data.shape[1] / target_length) + 1
    data = data[:, ::interval]
    if preserve_ends > 0:
        data = np.concatenate([l, data, r], axis=1)
    return data[0, :], data[1, :]

def geom_ind(stop, num=50):
    geo_num = num
    ind = np.geomspace(1, stop, dtype=int, num=geo_num)
    while len(set(ind)) < num - 1:
        geo_num += 1
        ind = np.geomspace(1, stop, dtype=int, num=geo_num)
    return np.sort(list(set(ind) | {0}))

def log_downsample(x, y, target_length=1000, flip=False):
    assert len(x.shape) == 1
    assert len(y.shape) == 1
    data = np.vstack((x, y))
    if flip:
        data = np.fliplr(data)
    data = data[:, geom_ind(data.shape[1], num=target_length)]
    if flip:
        data = np.fliplr(data)
    return data[0, :], data[1, :]

which allowed me to better preserve one side of plot:

newx, newy = downsample(x, y, target_length=1000, preserve_ends=50)
newlogx, newlogy = log_downsample(x, y, target_length=1000)
f = plt.figure()
plt.gca().set_yscale("log")
plt.step(x, y, label="original")
plt.step(newx, newy, label="downsample")
plt.step(newlogx, newlogy, label="log_downsample")
plt.legend()

test

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.