I need to iterate a dataframe, for each row I need to create a ID based on two existing columns: name and sex. Eventually I add this new column to the df.
df = pd.read_csv(file, sep='\t', dtype=str, na_values="", low_memory=False)
row_ids = []
for index, row in df.iterrows():
if (index % 1000) == 0:
print("Row node index: {}".format(str(index)))
caculated_id = get_id(row['name', row['sex']])
row_ids.append(caculated_id)
df['id'] = row_ids
Is there a way to make it much faster without going row by row?
Add more info based on suggested solutions:
get_idand a sample of the df?get_id(row['name', row['sex']])is supposed to do.idis just the concatenation ofnameandsex. You could dodf['id'] = df['name'] + df['sex']. Instead of a function that does something to indvidual cells, see if you can do things with entire columns.