0

I have a large Pandas Dataframe with the following structure:

data = {'id': [3, 5, 9, 12], 
        'names': ["{name1,name2,name3}", "{name1,name3}", "{name1,name2}", "{name2,name1,name3}"],
        'values':["{N,Y,N}", "{N,N}", "{Y,N}", "{N,Y,Y}"]
       }

df = pd.DataFrame(data)
df

Note that the names are not always in the same order or always all included for each id, however the order of the values does correspond to the names as ordered for each row.

I would like to turn this table into the following structure as efficiently as possible:

data = {'id': [3, 5, 9, 12], 
        'name1': ["N", "N", "Y", "Y"],
        'name2': ["Y", " ", "N", "N"],
        'name3': ["N", "N", " ", "Y"],
       }

df = pd.DataFrame(data)
df

Currently I am accomplishing this with the following subroutine where I essentially go through the df row by row and create lists of the names and values and then add those values to new columns. This works correctly but it is very slow (estimated at ~14 hrs) since my df is large (~2e5 rows). And each row or id can have up to 194 names, i.e. "{name1, name2, ..., name193, name194}".

def add_name_cols(df, title_col, value_col):
    nRows = len(df)
    for index,row in df.iterrows(): # parse rows and replace characters
        title_spl = [ i for i in row[title_col].replace('{','').replace('}','').split(',') ]
        value_spl = [ i for i in row[value_col].replace('{','').replace('}','').split(',') ]
        i = 0
        for t in title_spl: # add value in correct column for this row
            print('Progress rows: {0:2.2f}%, Progress columns: {1:2.2f}%'.format(float(index)/float(nRows)*100, float(i)/float(194)*100), end='\r')
            df.loc[index,t] = value_spl[i]
            i += 1
    return df

df_new = add_name_cols(df, 'names', 'values')
df_new

Is there a way to accomplish this manipulation using more of Pandas' built-in methods that would expedite this process?

1 Answer 1

2

Use string methods and dict constructor inside list comprehension.

(i) Convert df[['names','values']] to a list of lists

(ii) iterate over each pair, i.e. row in df, and use str.strip and str.split to create pair of lists, unpack and cast to dict constructor

(iii) Pass the resulting list of dictionaries to pd.DataFrame

temp = pd.DataFrame([dict(zip(*[x.strip('{}').split(',') for x in pair])) for pair in df[['names','values']].to_numpy().tolist()]).fillna('')
df[temp.columns] = temp
df = df.drop(['names','values'], axis=1)

Output:

   id name1 name2 name3
0   3     N     Y     N
1   5     N           N
2   9     Y     N      
3  12     Y     N     Y
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.