I need some help/suggestions/guidance on how I can optimize my code. The code works, but with huge data it has been running for almost a day. My data has ~ 2 million rows , with sample data ( few thousdand rows) it works .My sample data format is show below:
index A B
0 0.163 0.181
1 0.895 0.093
2 0.947 0.545
3 0.435 0.307
4 0.021 0.152
5 0.486 0.977
6 0.291 0.244
7 0.128 0.946
8 0.366 0.521
9 0.385 0.137
10 0.950 0.164
11 0.073 0.541
12 0.917 0.711
13 0.504 0.754
14 0.623 0.235
15 0.845 0.150
16 0.847 0.336
17 0.009 0.940
18 0.328 0.302
What I want to do : Given the above data set I want to bucket/bin each row into different buckets/bins based on values of A and B.Each index can only lie in one bin . To do this I have discretized A and B from 0 to 1( step size of 0.1). My bins for A look like this:
listA = [0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0]
similar for B.
listB = [0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0]
So total I have 10 * 10 = 100 bin So in total there are 100 bins , bin1 = (A,B) = (0,0) , bin 2 = (0,0.1) , bin 3 = (0,0.2)....bin 10 = (0,1), bin 11 = (0.1,0).....bin 20 = (0.1,1) ..... bin(100) = (1,1) Then for each index, I am checking which bin each index lies in running a for loop shown below :
for index in df.index:
sumlist = []
for A in listA:
for B in listB:
filt_data = df[(df['A'] > A) & (df['A'] < A) & (df['B'] > B) & (df_input['B'] < B)]
data_len = len(filt_data)
sumlist = sumlist.append(data_len)
df_sumlist = pd.DataFrame([sumlist])
df_output = pd.concat([df_output , df_sumlist ] , axis = 0)
I tried using the pandas cut function for binning but it appears that it works for one column.
Expected output
index A B bin1 bin2 bin3 bin4 bin5 ...bin 23.. bin100
0 0.163 0.181 0 0 0 0 0 1 0
1 0.895 0.093
2 0.947 0.545
3 0.435 0.307
4 0.021 0.152
5 0.486 0.977
6 0.291 0.244
7 0.128 0.946
8 0.366 0.521
9 0.385 0.137
10 0.950 0.164
11 0.073 0.541
12 0.917 0.711
13 0.504 0.754
14 0.623 0.235
15 0.845 0.150
16 0.847 0.336
17 0.009 0.940
18 0.328 0.302
I do care about other bins even if they are zero, for eg: index 0 might lie in bin 23 so for index 0 I will have 1 in bin 23 and 0 in all other 99 bins. Similarly for index 1, it might lie in bin 91 , so expected to have 1 in bin 91 and all bins 0 for index.
Thanks for taking the time to read and help me with this, appreciate your help. Please let me know if I am missing anything or need to clarify things.