0

I need some help/suggestions/guidance on how I can optimize my code. The code works, but with huge data it has been running for almost a day. My data has ~ 2 million rows , with sample data ( few thousdand rows) it works .My sample data format is show below:

index   A    B
0   0.163   0.181
1   0.895   0.093
2   0.947   0.545
3   0.435   0.307
4   0.021   0.152
5   0.486   0.977
6   0.291   0.244
7   0.128   0.946
8   0.366   0.521
9   0.385   0.137
10  0.950   0.164
11  0.073   0.541
12  0.917   0.711
13  0.504   0.754
14  0.623   0.235
15  0.845   0.150
16  0.847   0.336
17  0.009   0.940
18  0.328   0.302

What I want to do : Given the above data set I want to bucket/bin each row into different buckets/bins based on values of A and B.Each index can only lie in one bin . To do this I have discretized A and B from 0 to 1( step size of 0.1). My bins for A look like this:

listA = [0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0] 

similar for B.

listB = [0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0]

So total I have 10 * 10 = 100 bin So in total there are 100 bins , bin1 = (A,B) = (0,0) , bin 2 = (0,0.1) , bin 3 = (0,0.2)....bin 10 = (0,1), bin 11 = (0.1,0).....bin 20 = (0.1,1) ..... bin(100) = (1,1) Then for each index, I am checking which bin each index lies in running a for loop shown below :

for index in df.index:
  sumlist = []
  for A in listA:
    for B in listB:
      filt_data = df[(df['A'] > A) & (df['A'] < A) & (df['B'] > B) & (df_input['B'] < B)]
      data_len = len(filt_data)
      sumlist = sumlist.append(data_len)
      df_sumlist = pd.DataFrame([sumlist])
   df_output = pd.concat([df_output , df_sumlist ] , axis = 0)

I tried using the pandas cut function for binning but it appears that it works for one column.

Expected output

index   A         B    bin1   bin2 bin3 bin4 bin5 ...bin 23.. bin100    
    0   0.163   0.181   0      0     0   0    0           1     0
    1   0.895   0.093
    2   0.947   0.545
    3   0.435   0.307
    4   0.021   0.152
    5   0.486   0.977
    6   0.291   0.244
    7   0.128   0.946
    8   0.366   0.521
    9   0.385   0.137
    10  0.950   0.164
    11  0.073   0.541
    12  0.917   0.711
    13  0.504   0.754
    14  0.623   0.235
    15  0.845   0.150
    16  0.847   0.336
    17  0.009   0.940
    18  0.328   0.302

I do care about other bins even if they are zero, for eg: index 0 might lie in bin 23 so for index 0 I will have 1 in bin 23 and 0 in all other 99 bins. Similarly for index 1, it might lie in bin 91 , so expected to have 1 in bin 91 and all bins 0 for index.

Thanks for taking the time to read and help me with this, appreciate your help. Please let me know if I am missing anything or need to clarify things.

2 Answers 2

1

You were on the right track! pd.cut is the way to go. I'm using the Series categories to create your final bins:

import pandas as pd
import numpy as np

# Generate sample df
df = pd.DataFrame({'A': np.random.uniform(size=20), 'B': np.random.uniform(size=20)})

# Create bins for each column
df["bin_A"] = pd.cut(df["A"], bins=np.linspace(0, 1, 11))
df["bin_B"] = pd.cut(df["B"], bins=np.linspace(0, 1, 11))

# Create a combined bin using category codes for each binned column
df["combined_bin"] = df["bin_A"].cat.codes * 10 + df["bin_B"].cat.codes
df["combined_bin"] = pd.Categorical(df["combined_bin"], categories=range(100))

# Loop over categories to create new columns
for i in df["combined_bin"].cat.categories:
    df[f"bin_{i}"] = (df["combined_bin"] == i).astype(int)

EDIT – Generalized solution: The important part here is defining all possible combinations of bins in both columns, using itertools.product:

import pandas as pd
import numpy as np
import itertools

df = pd.DataFrame({'A': np.random.uniform(size=20), 'B': np.random.uniform(size=20)})

# Change number of bins here or update the `bins` parameter
N_BINS_A = 10
N_BINS_B = 10
df["bin_A"] = pd.cut(df["A"], bins=np.linspace(0, 1, N_BINS_A + 1))
df["bin_B"] = pd.cut(df["B"], bins=np.linspace(0, 1, N_BINS_B + 1))

# Specify all possible bin combinations to use for columns
bin_A_bin_B_combinations = itertools.product(
    df['bin_A'].cat.categories, 
    df['bin_B'].cat.categories,
)

# Loop over possible combinations and mark matches
for i, (bin_A, bin_B) in enumerate(bin_A_bin_B_combinations):
    df[f"bin_{i}"] = (
        (df["bin_A"] == bin_A) & (df["bin_B"] == bin_B)
    ).astype(int)
Sign up to request clarification or add additional context in comments.

3 Comments

Sure, the binning (with pd.cut) shouldn't cause any issues; just pay attention to the combined_bin column – if you have multiple bins in each column, you'll have to combine them in a different way
so if replace 10 by N,100 by N * N and 11 by N +1 will it be the correct implementation?
@SeasonedLeo Added a more general solution
0

You could probably use cut on each column and then combine the results to find the category of the row

acat = pd.cut(df['A'], [.1*i for i in range(11)],
       labels = range(10), include_lowest=True)
bcat = pd.cut(df['B'], [.1*i for i in range(11)],
       labels = range(10), include_lowest=True)
cat = 1 + bcat.cat.codes + acat.cat.codes * 10

With your sample data, I get

0     12
1     81
2     96
3     44
4      2
5     50
6     23
7     20
8     36
9     32
10    92
11     6
12    98
13    58
14    63
15    82
16    84
17    10
18    34
dtype: int8

get_dummies and reindex will give the wide columns

w = pd.get_dummies(cat).reindex(columns=list(range(1,101))).fillna(0).astype('int8')

We only have to concat it to the original dataframe:

pd.concat([df, w], axis=1)

to get as expected:

        index      A      B  1  2  3  4  5  6  ...  92  93  94  95  96  97  98  99  100
0       0  0.163  0.181  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
1       1  0.895  0.093  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
2       2  0.947  0.545  0  0  0  0  0  0  ...   0   0   0   0   1   0   0   0    0
3       3  0.435  0.307  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
4       4  0.021  0.152  0  1  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
5       5  0.486  0.977  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
6       6  0.291  0.244  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
7       7  0.128  0.946  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
8       8  0.366  0.521  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
9       9  0.385  0.137  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
10     10  0.950  0.164  0  0  0  0  0  0  ...   1   0   0   0   0   0   0   0    0
11     11  0.073  0.541  0  0  0  0  0  1  ...   0   0   0   0   0   0   0   0    0
12     12  0.917  0.711  0  0  0  0  0  0  ...   0   0   0   0   0   0   1   0    0
13     13  0.504  0.754  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
14     14  0.623  0.235  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
15     15  0.845  0.150  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
16     16  0.847  0.336  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
17     17  0.009  0.940  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0
18     18  0.328  0.302  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0    0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.