2

Given, I have two dataframes. The first is a dataframe with 22 columns (df1). This holds the values for all variables. The second (good_models) is a dataframe which holds the columns of interest that I want to extract from df1. For every row in the good_models, I need to use the values from (var1-var10 and target) and keep only these columns from the df1. Rename the columns (var1-var10 and target) then append to a new dataframe called model_data_long

I can perform this task using a for loop, however it is very slow and I am hoping there is a more efficient way to perform this.

Potentially I can reference this, however am unsure on how to apply it. Stack Overflow

import numpy as np
import pandas as pd
df1=pd.DataFrame(np.random.randint(0,100,size=(100,20+1)),columns=list(range(0,20+1)))
df1['target']=np.random.randint(2,size=100)
### This needs to be the columns in model_data_long
labels=['var1','var2','var3','var4','var5','var6','var7','var8','var9','var10','target']
### Contains the columns I want to exctract from df1 and append to model_data_long    
good_models=pd.DataFrame.from_records([(0,1,2,3,4,5,6,7,8,9,'target'),
                                     (9,8,7,6,5,4,3,2,1,0,'target'),
                                     (20,19,18,17,16,15,14,13,12,11,'target')],columns=labels)
### works but is slow
model_data_long=pd.DataFrame()
for i in range(0,len(good_models)):
    ### Extracting the values for a record from good_models
    t_list=good_models[good_models.index==i].values.tolist()[0]
    ### Keeping only the columns from t_list from the df1 frame.
    temp_data=pd.DataFrame(data=df1.filter(items=t_list,axis=1))
    ### renaming the columns in temp_data
    temp_data.columns=[labels]
    ### It is imparative that I have an index variable in the model_data_long dataframe.
    ### Setting the model_index variable, critical.
    temp_data['model_index']=i
    ### Finally, append to a long running dataframe.
    model_data_long=model_data_long.append([temp_data],ignore_index=True)

2 Answers 2

2

You can use very fast numpy solution, thanks divakar for answer:

#convert df1 to numpy array
a = df1.values
#convert first 10 columns to numpy array
b = good_models.iloc[:, :10].values
#reshape in numpy, add all columns names without last
df = pd.DataFrame(a[:, b].swapaxes(0,1).reshape(-1,b.shape[1]), columns=labels[:-1])

#add new columns - by repating with tile and repeat
df['target'] = np.tile(df1['target'].values, len(good_models))
df['model_index'] = np.repeat(good_models.index, len(df1))

Timings:

In [251]: %timeit (jez())
100 loops, best of 3: 2.47 ms per loop

In [252]: %timeit (orig())
1 loop, best of 3: 703 ms per loop

Setup:

import numpy as np
import pandas as pd
np.random.seed(452)
df1=pd.DataFrame(np.random.randint(0,100,size=(100,20+1)),columns=list(range(0,20+1)))
df1['target']=np.random.randint(2,size=100)
### This needs to be the columns in model_data_long
labels=['var1','var2','var3','var4','var5','var6','var7','var8','var9','var10','target']
### Contains the columns I want to exctract from df1 and append to model_data_long    
good_models=pd.DataFrame.from_records([(0,1,2,3,4,5,6,7,8,9,'target'),
                                     (9,8,7,6,5,4,3,2,1,0,'target'),
                                     (20,19,18,17,16,15,14,13,12,11,'target')],columns=labels)
### works but is slow

#100 times repeat rows for 300 rows                                         
good_models = pd.concat([good_models]*100).reset_index(drop=True)

Functions:

def orig():

    model_data_long=pd.DataFrame()
    for i in range(0,len(good_models)):
        ### Extracting the values for a record from good_models
        t_list=good_models[good_models.index==i].values.tolist()[0]
        ### Keeping only the columns from t_list from the df1 frame.
        temp_data=pd.DataFrame(data=df1.filter(items=t_list,axis=1))
        ### renaming the columns in temp_data
        temp_data.columns=[labels]
        ### It is imparative that I have an index variable in the model_data_long dataframe.
        ### Setting the model_index variable, critical.
        temp_data['model_index']=i
        ### Finally, append to a long running dataframe.
        model_data_long=model_data_long.append([temp_data],ignore_index=True)

    return model_data_long


def jez():
    #convert df1 to numpy array
    a = df1.values
    #convert first 10 columns to numpy array
    b = good_models.iloc[:, :10].values
    #reshape in numpy, add all columns names without last
    df = pd.DataFrame(a[:, b].swapaxes(0,1).reshape(-1,b.shape[1]), columns=labels[:-1])

    #add new columns - by repating with tile and repeat
    df['target'] = np.tile(df1['target'].values, len(good_models))
    df['model_index'] = np.repeat(good_models.index, len(df1))
    return df
Sign up to request clarification or add additional context in comments.

1 Comment

jesrael, first... this is lightning fast in comparison to what i was doing. I dont technically understand what all is going on here but I can tell it is accurate (modifications to your original solution). With an additional function that I added to your "jez" function, I applied the group by "model_index" so i can group by model index, apply the logit and hit it with one pass, instead of my normal for loop and subsetting. Results, 2 minutes for 10,000 models. Versus my normal approach would take around 10-15 minutes for 10,000 models.
1

I figured out 2 ways of doing it without jumping on multiprocessing package. I have utilized simple pandas trick, see it works for you or not. I will share the other one later

pd.concat([pd.DataFrame(df1[i].values, columns=labels) for i in good_models.values.tolist()],ignore_index=True)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.