Pandas - Creating binary data from existing data

Question

I'm trying to create binary data from an existing dataframe, but it's taking a very long time to complete. Is there any quicker way to accomplish this?

What I have now is a dataframe with multiple rows, say df e.g.:

Index   Actions Tries   Ratio
0       20      200     0,1
1       10      400     0,025
2       15      500     0,03
3       30      700     0,04

I now want to transform this data to binary data, say df_binary e.g.

Index_old   Index_new   Actions Tries   Ratio   Success
0           0           20      200     0,1     1
0           1           20      200     0,1     1
0           2           20      200     0,1     1
0           3           20      200     0,1     1
...     
0           19          20      200     0,1     1  -> 20 times success(1)   
0           20          20      200     0,1     0
0           21          20      200     0,1     0
0           22          20      200     0,1     0
...                 
0           199         20      200     0,1     0  -> 200-20= 180 times fail(0)
1           200         10      400     0,025   1
1           201         10      400     0,025   1
1           202         10      400     0,025   1

As can be seen from the above example, the Actions / Tries = Ratio. The amount of times this should be replicated is based on Tries, the amount of times succes = 1, is based on Actions. The amount of times success = 0 is based on Tries - Actions.

import pandas as pd
#create the new DataFrame
df_binary = pd.DataFrame()
#iterate over all rows in the original DataFrame (df)
for index,row in df.iterrows():
    #get the number of tries from the row in the df
    tries = row['Tries']
    #get the number of actions from the row in the df
    actions = row['Actions']
    #calculate the number of times the tries did not result in action
    noActions = tries - actions
    #create a temporary df used for appending
    tempDf = row

    #loop for the range given by tries (row['Tries']) e.g. loop 200 times      
    for try in range(tries):  
        if try < actions:
            #if the number of actions is lower than tries, set success to 1. E.g. try 1 < 20, set success, try 15 < 20, set success
            tempDf['Success'] = 1
            #append new data to df_binary
            df_binary = df_binary.append(tempDf, ignore_index=True)
        else:
            #else set success to failure, e.g. try 25 > 20 set failure, try 180 > 20 set failure.
            tempDf['Success'] = 0
            #append new data to df_binary
            df_binary = df_binary.append(tempDf, ignore_index=True)

In this example, the time to complete won't be that long. But my actual new df_binary should contain about 15 million rows after completion and contains many more columns, which takes very long to complete.

Is there any way to do this faster?

Thanks!

Chris Adams · Accepted Answer · 2018-09-04 12:08:07Z

1

Here is one potential way to achieve this, using pandas.concat, Series.repeat and DataFrame.assign in a list comprehension:

successes = np.concatenate([[1]*a + [0]*(t-a) for a, t in zip(df['Actions'], df['Tries'])])

df_binary = (pd.concat([df[s].repeat(df['Tries']) for s in df], axis=1)
             .assign(success=successes).reset_index())

edited Sep 4, 2018 at 12:08

answered Sep 4, 2018 at 10:34

Chris Adams

18.7k4 gold badges26 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

kernell Over a year ago

This is great! Thanks! While my script was running for hours already, this piece of code fixed it in a few minutes.

Collectives™ on Stack Overflow

Pandas - Creating binary data from existing data

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related