2

I'm trying to create binary data from an existing dataframe, but it's taking a very long time to complete. Is there any quicker way to accomplish this?

What I have now is a dataframe with multiple rows, say df e.g.:

Index   Actions Tries   Ratio
0       20      200     0,1
1       10      400     0,025
2       15      500     0,03
3       30      700     0,04

I now want to transform this data to binary data, say df_binary e.g.

Index_old   Index_new   Actions Tries   Ratio   Success
0           0           20      200     0,1     1
0           1           20      200     0,1     1
0           2           20      200     0,1     1
0           3           20      200     0,1     1
...     
0           19          20      200     0,1     1  -> 20 times success(1)   
0           20          20      200     0,1     0
0           21          20      200     0,1     0
0           22          20      200     0,1     0
...                 
0           199         20      200     0,1     0  -> 200-20= 180 times fail(0)
1           200         10      400     0,025   1
1           201         10      400     0,025   1
1           202         10      400     0,025   1

As can be seen from the above example, the Actions / Tries = Ratio. The amount of times this should be replicated is based on Tries, the amount of times succes = 1, is based on Actions. The amount of times success = 0 is based on Tries - Actions.

import pandas as pd
#create the new DataFrame
df_binary = pd.DataFrame()
#iterate over all rows in the original DataFrame (df)
for index,row in df.iterrows():
    #get the number of tries from the row in the df
    tries = row['Tries']
    #get the number of actions from the row in the df
    actions = row['Actions']
    #calculate the number of times the tries did not result in action
    noActions = tries - actions
    #create a temporary df used for appending
    tempDf = row

    #loop for the range given by tries (row['Tries']) e.g. loop 200 times      
    for try in range(tries):  
        if try < actions:
            #if the number of actions is lower than tries, set success to 1. E.g. try 1 < 20, set success, try 15 < 20, set success
            tempDf['Success'] = 1
            #append new data to df_binary
            df_binary = df_binary.append(tempDf, ignore_index=True)
        else:
            #else set success to failure, e.g. try 25 > 20 set failure, try 180 > 20 set failure.
            tempDf['Success'] = 0
            #append new data to df_binary
            df_binary = df_binary.append(tempDf, ignore_index=True)

In this example, the time to complete won't be that long. But my actual new df_binary should contain about 15 million rows after completion and contains many more columns, which takes very long to complete.

Is there any way to do this faster?

Thanks!

1 Answer 1

1

Here is one potential way to achieve this, using pandas.concat, Series.repeat and DataFrame.assign in a list comprehension:

successes = np.concatenate([[1]*a + [0]*(t-a) for a, t in zip(df['Actions'], df['Tries'])])

df_binary = (pd.concat([df[s].repeat(df['Tries']) for s in df], axis=1)
             .assign(success=successes).reset_index())
Sign up to request clarification or add additional context in comments.

1 Comment

This is great! Thanks! While my script was running for hours already, this piece of code fixed it in a few minutes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.