0

My question is how to simplify the following code. In my real data I would have to add 540 columns, but I guess there is a better way, especially for generating the columns. Maybe even separate dataframes?

Below you'll see the test-df I have with one column needed. 'bin_X_0' through 'bin_X_9' need to be generated, but then several, i.e. bin_Y_0, bin_Z_0 etc. through 9.

N = 10000
J = [2012,2013,2014]
K = ['A','B','C','D','E','F','G','H']
L = ['h', 'd', 'a'] 
S = ['AR1','PO1','RU1']

np.random.seed(0)

df = pd.DataFrame(
    {'Y':np.random.uniform(1,10,N),
     'X':np.random.uniform(1,10,N),
     'Z':np.random.uniform(1,10,N),
     'J':np.random.choice(J,N),
     'S':np.random.choice(S,N),
     'R':np.random.choice(L,N)
    })

df['bins_X'] = df.groupby('S').X.apply(pd.qcut, q=10, labels=np.arange(10))

df['bin_X_0'] = np.where((df['bins_X'] ==0) & (df['R'] =='a'), (df['X']*2)-2, 
                        np.where((df['bins_X'] ==0) & (df['R'] !='a'), -2, 0))

df.head()

enter image description here

2 Answers 2

1

You can achieve this with a double for loop, by using current iterables and format function for the new column generation. The following code should do it:

for a in ["X", "Y", "Z"]:
    for num in list(range(0, 10)):
        df['bins_{}'.format(a)] = df.groupby('S')["{}".format(a)].apply(pd.qcut, q=10, labels=np.arange(10))

        df['bin_{}_{}'.format(a, num)] = np.where((df['bins_{}'.format(a)] == num) & (df['R'] == 'a'),
                                                  (df['{}'.format(a)] * 2) - 2,
                                                  np.where((df['bins_{}'.format(a)] == num) & (df['R'] != 'a'), -2, 0))

I hope it is clear, how current numbers and letters are dynamically picking up the appropriate data and generating two new columns inside every nested loop. The above code produces this dataframe, which I believe is what you are looking for:

 J  R    S         X         Y         Z bins_X   bin_X_0  bin_X_1  \
0  2014  d  PO1  7.734412  5.939322  4.529557      7  0.000000      0.0   
1  2014  h  AR1  2.621824  7.436704  1.370409      1  0.000000     -2.0   
2  2013  h  PO1  4.501208  6.424870  9.309705      4  0.000000      0.0   
3  2012  a  RU1  1.338402  5.903949  4.656115      0  0.676803      0.0   
4  2012  d  RU1  1.106090  4.812893  9.498540      0 -2.000000      0.0   

   bin_X_2   ...     bin_Z_0  bin_Z_1  bin_Z_2  bin_Z_3  bin_Z_4  bin_Z_5  \
0      0.0   ...         0.0      0.0      0.0     -2.0  0.00000      0.0   
1      0.0   ...        -2.0      0.0      0.0      0.0  0.00000      0.0   
2      0.0   ...         0.0      0.0      0.0      0.0  0.00000      0.0   
3      0.0   ...         0.0      0.0      0.0      0.0  7.31223      0.0   
4      0.0   ...         0.0      0.0      0.0      0.0  0.00000      0.0   

   bin_Z_6 bin_Z_7  bin_Z_8  bin_Z_9  
0      0.0     0.0      0.0      0.0  
1      0.0     0.0      0.0      0.0  
2      0.0     0.0      0.0     -2.0  
3      0.0     0.0      0.0      0.0  
4      0.0     0.0      0.0     -2.0 

I hope you get the idea.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks Marjan, looks good. I'll check. One thing though I forgot to mention. After.... format(a)] == it should be the corresponding bin. In the example it is 0, should be i.e bin_X_5.... 5
I replaced two zeros with num as you suggested. Otherwise, I haven't dug into the computations alone, so there might me still some number unchanged into variable. Feel free to edit my answer once you see the pattern.
0

the best way to add new columns if they are related to the data already in the DataFrame is to use pd.concat or pd.merge.

You can read more about it on http://pandas.pydata.org/pandas-docs/stable/merging.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.