I'm doing data analysis that involves minimizing the least-square-error between a set of points and a corresponding set of orthogonal functions. In other words, I'm taking a set of y-values and a set of functions, and trying to zero in on the x-value that gets all of the functions closest to their corresponding y-value. Everything is being done in a 'data_set' class. The functions that I'm comparing to are all stored in one list, and I'm using a class method to calculate the total lsq-error for all of them:
self.fits = [np.poly1d(np.polyfit(self.x_data, self.y_data[n],10)) for n in range(self.num_points)]
def error(self, x, y_set):
arr = [(y_set[n] - self.fits[n](x))**2 for n in range(self.num_points)]
return np.sum(arr)
This was fine when I had significantly more time than data, but now I'm taking thousands of x-values, each with a thousand y-values, and that for loop is unacceptably slow. I've been trying to use np.vectorize:
#global scope
def func(f,x):
return f(x)
vfunc = np.vectorize(func, excluded=['x'])
…
…
#within data_set class
def error(self, x, y_set):
arr = (y_set - vfunc(self.fits, x))**2
return np.sum(arr)
func(self.fits[n], x) works fine as long as n is valid, and as far as I can tell from the docs, vfunc(self.fits, x) should be equivalent to
[self.fits[n](x) for n in range(self.num_points)]
but instead it throws:
ValueError: cannot copy sequence with size 10 to array axis with dimension 11
10 is the degree of the polynomial fit, and 11 is (by definition) the number of terms in it, but I have no idea why they're showing up here. If I change the fit order, the error message reflects the change. It seems like np.vectorize is taking each element of self.fits as a list rather than a np.poly1d function.
Anyway, if someone could either help me understand np.vectorize better, or suggest another way to eliminate that loop, that would be swell.
np.vectorizedoes not make things faster. If I recall an earlier SO, itsexcluded=['x']only works ifxis a keyword argument. So not only is it slower than plain iteration, it is harder to use correctly. It's useful only if you are broadcasting one input argument against another.