I have a large Pandas Dataframe with the following structure:
data = {'id': [3, 5, 9, 12],
'names': ["{name1,name2,name3}", "{name1,name3}", "{name1,name2}", "{name2,name1,name3}"],
'values':["{N,Y,N}", "{N,N}", "{Y,N}", "{N,Y,Y}"]
}
df = pd.DataFrame(data)
df
Note that the names are not always in the same order or always all included for each id, however the order of the values does correspond to the names as ordered for each row.
I would like to turn this table into the following structure as efficiently as possible:
data = {'id': [3, 5, 9, 12],
'name1': ["N", "N", "Y", "Y"],
'name2': ["Y", " ", "N", "N"],
'name3': ["N", "N", " ", "Y"],
}
df = pd.DataFrame(data)
df
Currently I am accomplishing this with the following subroutine where I essentially go through the df row by row and create lists of the names and values and then add those values to new columns. This works correctly but it is very slow (estimated at ~14 hrs) since my df is large (~2e5 rows). And each row or id can have up to 194 names, i.e. "{name1, name2, ..., name193, name194}".
def add_name_cols(df, title_col, value_col):
nRows = len(df)
for index,row in df.iterrows(): # parse rows and replace characters
title_spl = [ i for i in row[title_col].replace('{','').replace('}','').split(',') ]
value_spl = [ i for i in row[value_col].replace('{','').replace('}','').split(',') ]
i = 0
for t in title_spl: # add value in correct column for this row
print('Progress rows: {0:2.2f}%, Progress columns: {1:2.2f}%'.format(float(index)/float(nRows)*100, float(i)/float(194)*100), end='\r')
df.loc[index,t] = value_spl[i]
i += 1
return df
df_new = add_name_cols(df, 'names', 'values')
df_new
Is there a way to accomplish this manipulation using more of Pandas' built-in methods that would expedite this process?