1

I have a dataframe:

import pandas as pd
df = pd.DataFrame({
    'PC1' : [0.035182, 0.001649, -0.080456, 0.056460, 0.017737, -0.005615, 0.033691, 0.547145, -0.022938, -0.059511], 
    'PC2': [0.034898, 0.001629, -0.083374, 0.053976, 0.017603,-0.005902, 0.006798, 0.250167, -0.137955, -0.313852], 
    'PC3': [0.032212, 0.001591, -0.067145, 0.047500, 0.015782, -0.003079, 0.012376, 0.302485, -0.063795, -0.124957], 
    'PC4' : [-0.000632,0.001268,0.063346,-0.026841,-0.009790,0.029897,-0.018870,-0.449655,0.081417,-0.327028], 
    'PC5' : [0.020340,0.001734,-0.050830,0.008507,0.007470,0.013534,0.100008,1.083280,0.298315,0.736401], 
    'PC6' : [0.027012,0.001507,-0.036496,0.032256,0.012207,0.005451,0.081582,0.959821,0.337683,0.758737], 
    'PC7' : [0.027903,0.001625,-0.041970,0.039854,0.014676,0.002364,0.045583,0.620938,0.116647,0.214294], 
    'PC8' : [0.013828,-0.015836,-0.117484,-0.208933,-0.162090,-0.190467,-0.075784,-0.481607,-0.213148,-0.401169], 
    'PC9' : [0.009378,0.002712,-0.148531,0.040901,0.011923,-0.000078,-0.055367,-0.661758,0.242363,-0.392438], 
    'PC10' : [-0.002740,-0.000234,0.060118,0.027855,0.016309,0.009850,-0.108481,-1.560047,0.198750,-0.793165], 
    'PC11' : [-2.876278,-0.437754,0.764775,-0.627843,0.391284,0.090675,-0.007820,0.342359,0.052004,-0.200808], 
    'PC12' : [-2.411929,-0.414697,0.415683,-0.426348,0.302643,-0.160550,-0.051552,1.086344,-0.275267,1.219304]
})

df.head()

I applied a function 'pd.cut' to each column in the dataframe. qcut basically is Quantile-based discretization function.

cuts = []

for col in df.columns:
    cuts.append(pd.qcut(df[col], 2, labels=None, retbins=False, precision=3, duplicates='raise'))

X = pd.concat(cuts, axis=1)

Then, I want to take only 2 values that are unique from each column PC1, PC2,..... PCn.

uniq = []
for i in x.columns:
    uniq.append(x[i].unique())

unique = pd.DataFrame(uniq)
unique

The result look like this:

enter image description here

Unique variable consists 2 values in the form of (a,b]

Then I want to customize transformer class to create new categorical dummy features.

# custom transformer class to create new categorical dummy features
class WoE_Binning(BaseEstimator, TransformerMixin):
    def __init__(self, X): # no *args or *kargs
        self.X = X
    def fit(self, X, y = None):
        return self #nothing else to do
    def transform(self, X):
      
        X_new['PC1:0.00969 - 0.547'] = np.where((X['PC1'] > 0.00969) & (X['PC1'] <= 0.547), 1, 0)
        X_new['PC1:-0.0815 - 0.00969'] = np.where((X['PC1'] > 0.0815 ) & (X['PC1'] <= 0.00969), 1, 0)
        X_new['PC2:0.00421 - 0.25'] = np.where((X['PC2'] > 0.00421) & (X['PC2'] <= 0.25), 1, 0)
        X_new['PC2:-0.315 - 0.00421'] = np.where((X['PC2'] > 0.315) & (X['PC2'] <= 0.00421), 1, 0)
        X_new['PC3:0.00698 - 0.302'] = np.where((X['PC3'] > 7.071) & (X['PC3'] <= 10.374), 1, 0)
        X_new['PC3:-0.126 - 0.00698'] = np.where((X['PC3'] > 10.374) & (X['PC3'] <= 13.676), 1, 0)
        X_new['PC4:-0.00521 - 0.0814'] = np.where((X['PC4'] > 7.071) & (X['PC4'] <= 10.374), 1, 0)
        X_new['PC4:-0.451 - -0.00521'] = np.where((X['PC4'] > 10.374) & (X['PC4'] <= 13.676), 1, 0)        
        X_new['PC5:0.0169 - 1.083'] = np.where((X['PC5'] > 7.071) & (X['PC5'] <= 10.374), 1, 0)
        X_new['PC5:-0.0518 - 0.0169'] = np.where((X['PC5'] > 10.374) & (X['PC5'] <= 13.676), 1, 0)        
        X_new['PC6:-0.0375 - 0.0296'] = np.where((X['PC6'] > 7.071) & (X['PC6'] <= 10.374), 1, 0)
        X_new['PC6:0.0296 - 0.96'] = np.where((X['PC6'] > 10.374) & (X['PC6'] <= 13.676), 1, 0)       
        X_new['PC7:0.0296 - 0.96'] = np.where((X['PC7'] > 7.071) & (X['PC7'] <= 10.374), 1, 0)
        X_new['PC7:-0.043000000000000003 - 0.0339'] = np.where((X['PC7'] > 10.374) & (X['PC7'] <= 13.676), 1, 0)
        X_new['PC8:-0.176 - 0.0138'] = np.where((X['PC8'] > 7.071) & (X['PC8'] <= 10.374), 1, 0)
        X_new['PC8:-0.483 - -0.176'] = np.where((X['PC8'] > 10.374) & (X['PC8'] <= 13.676), 1, 0)
        X_new['PC9:0.00132 - 0.242'] = np.where((X['PC9'] > 7.071) & (X['PC9'] <= 10.374), 1, 0)
        X_new['PC9:-0.663 - 0.00132'] = np.where((X['PC9'] > 10.374) & (X['PC9'] <= 13.676), 1, 0)
        X_new['PC10:-1.561 - 0.00481'] = np.where((X['PC10'] > 7.071) & (X['PC10'] <= 10.374), 1, 0)
        X_new['PC10:0.00481 - 0.199'] = np.where((X['PC10'] > 10.374) & (X['PC10'] <= 13.676), 1, 0)        
        X_new['PC11:-2.877 - 0.0221'] = np.where((X['PC11'] > 7.071) & (X['PC11'] <= 10.374), 1, 0)
        X_new['PC11:0.0221 - 0.765'] = np.where((X['PC11'] > 10.374) & (X['PC11'] <= 13.676), 1, 0)        
        X_new['PC12:-2.413 - -0.106'] = np.where((X['PC12'] > 7.071) & (X['PC12'] <= 10.374), 1, 0)
        X_new['PC12:-0.106 - 1.219'] = np.where((X['PC12'] > 10.374) & (X['PC12'] <= 13.676), 1, 0)              
        X_new.drop(columns = ref_categories, inplace = True)
        return X_new

Is there any faster and simple way to input (a,b] in unique variable and slice column name of X (PC1, PC2, ...PCn) into :

X_new['PC12:-0.106 - 1.219'] = np.where((X['PC12'] > a ) & (X['PC12'] <= b ), 1, 0) 
1
  • Thanks for making a new post! You do not need to pass X to transform as X should already exist as self.X, and if it is meant to be different, you should rename it to make that clear. Also, what is X_new meant to be? It would help if you explain the intended output. The values in X are (a,b], but you compare it to a single value? Do you mean to be making these columns in df and not X? Commented Nov 10, 2021 at 10:40

1 Answer 1

2

Given the dataframes df and unique you could do

X_new = pd.concat(
    (
        ((interval.left < df[col]) & (df[col] <= interval.right))
            .rename(f"{col}: {interval.left} - {interval.right}")
        for i, col in enumerate(df.columns) for interval in unique.iloc[:, i]
    ),
    axis=1
).astype(int)

or

X_new = pd.concat(
    (
        pd.cut(df[col], [interval.left, interval.right])
          .rename(f"{col}: {interval.left} - {interval.right}")
        for i, col in enumerate(df.columns) for interval in unique.iloc[:, i]
    ),
    axis=1
).notna().astype(int)

Result:

   PC1: 0.00969 - 0.547  ...  PC12: -0.106 - 1.219
0                     1  ...                     0
1                     0  ...                     0
2                     0  ...                     1
3                     1  ...                     0
4                     1  ...                     1
5                     0  ...                     0
6                     1  ...                     1
7                     0  ...                     1
8                     0  ...                     0
9                     0  ...                     0

[10 rows x 24 columns]

Or build unique with column names either this way

unique = pd.concat(
    (pd.DataFrame(X[col].unique(), columns=[col]) for col in X.columns),
    axis=1
)

or, if you don't need X, this way

unique = pd.DataFrame(
    {
        col: pd.qcut(
            df[col], 2, labels=None, retbins=False, precision=3, duplicates='raise'
        ).unique()
        for col in df.columns
    }
)

and then do

X_new = pd.concat(
    (
        ((interval.left < df[col]) & (df[col] <= interval.right))
            .rename(f"{col}: {interval.left} - {interval.right}")
        for col in unique.columns for interval in unique[col]
    ),
    axis=1
).astype(int)

etc.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, it really helps me to loop lots of values in my dataframe. I appreciate you doing this. I know Python is powerful in data processing.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.