I have been a frequent lurker on Stack Overflow for some time and I tend to find very useful and clear information from here whenever I have coding questions. However, I can't really seem to find a thread that addresses my specific inquiry today.
Earlier today, I learned about vectorizing functions in Python in order to speed up computing time. I am currently trying to optimize a python program that I had written a little over a month ago. My program takes a text file containing data in the following format:
<magnitude> <dmagnitude> <exposure_number>
I then assign each column to lists mag, dmag, and expnum.
What I want to do is create a 2d array of the mag and dmag values that share the same expnum (having the same exposure number means that the mag and dmag point to the same data point).
I do this for all exposure numbers and, at the end, I take the median of the mag and dmag, and the standard deviation of the mag for each of the exposure number-based arrays and combine them all into one array that I can plot.
Currently, I have the following code:
from numpy import loadtxt,array,asarray,append,std,median,empty,take
data = loadtxt(infile,usecols=(0,1,2))
mag = data1[:,2].tolist()
dmag = data1[:,3].tolist()
expnum = data1[:,4].tolist()
#initialize variables
indexing = list()
master_mag = list()
master_dmag = list()
sub_mag = list()
sub_dmag = list()
mag_std = array([])
mag_stdmed = array([])
mag_med = array([])
while len(mag) > 0:
num=expnum[0]
for i in range(0,len(expnum)):
if expnum[i] == num:
sub_mag.append(mag[i])
sub_dmag.append(dmag[i])
indexing.append(i)
#add the sub lists to their master lists
master_mag.append(sub_mag)
master_dmag.append(sub_dmag)
sub_mag=list()
sub_dmag=list()
#remove from mag, dmag, and expnum the index referred to by indexing
while len(indexing) > 0:
mag.pop(indexing[-1])
dmag.pop(indexing[-1])
expnum.pop(indexing[-1])
indexing.pop()
#make the master mag and dmag lists into numpy arrays
master_mag=asarray(master_mag)
master_dmag=asarray(master_dmag)
#generate the mag and dmag median and mag std arrays
for i in range(0,len(master_mag)):
mag_std=append(mag_std,std(master_mag[i]))
mag_med=append(mag_med,median(master_mag[i]))
mag_stdmed=append(mag_stdmed,median(master_dmag[i]))
#create empty numpy arrays to be used for mag med vs. mag std
#and mag med vs. dmag med
med_std=empty([0,2])
med_dmed=empty([0,2])
#fill in those arrays
for i in range(0,len(mag_std)):
med_std=append(med_std,[[mag_med[i],mag_std[i]]],axis=0)
med_dmed=append(med_dmed,[[mag_med[i],mag_stdmed[i]]],axis=0)
#sort the median mag and dmag standard deviation arrays by median mag
order_med_std=med_std[:,0].argsort()
order_med_dmed=med_dmed[:,0].argsort()
sorted_med_std=take(med_std,order_med_std,0)
sorted_med_dmed=take(med_dmed,order_med_dmed,0)
And then I'm ready to plot sorted_med_dmed[:,0] vs. sorted_med_dmed[:,1] and sorted_med_std[:,0] vs. sorted_med_std[:,1]
This code works, it's just that I feel that it is too slow (especially when I get over 10,000 data points to work with). I want to try and vectorize this code to make it much quicker, but I have no idea where to begin.
I would like some help figuring out how to vectorize the matching-by-exposure-number component. I was thinking of creating a multi-dimensional array at the start that has the format: array([[[mag],[dmag]],...]) and a length equal to the number of different exposure numbers. Is there a way to generate and update an array like this in-line, without having to use a ton of loops?
Please let me know if you need any further clarity on what exactly this code is doing.
Thank you for your time.
multiprocessing. You can look at this example and figure out how you can re-write your code in this frame.