How to optimize dataframe iteration in pandas?

Question

I need to iterate a dataframe, for each row I need to create a ID based on two existing columns: name and sex. Eventually I add this new column to the df.

df = pd.read_csv(file, sep='\t', dtype=str, na_values="", low_memory=False)
   row_ids = []
   for index, row in df.iterrows():
       if (index % 1000) == 0:
          print("Row node index: {}".format(str(index)))
     
     caculated_id = get_id(row['name', row['sex']])
     row_ids.append(caculated_id)

   df['id'] = row_ids

Is there a way to make it much faster without going row by row?

Add more info based on suggested solutions:

Could you include the function get_id and a sample of the df? — yudhiesh
– yudhiesh, Commented Oct 4, 2021 at 1:57
It is a regular function, taking input and return anything. Just for example purpose. — marlon
– marlon, Commented Oct 4, 2021 at 1:59
How is the id constructed? Include a small sample dataframe. I'm not sure what get_id(row['name', row['sex']]) is supposed to do. — tdelaney
– tdelaney, Commented Oct 4, 2021 at 1:59
@marlon - pandas lets you perform operations in bulk. Suppose id is just the concatenation of name and sex. You could do df['id'] = df['name'] + df['sex']. Instead of a function that does something to indvidual cells, see if you can do things with entire columns. — tdelaney
– tdelaney, Commented Oct 4, 2021 at 2:03

U13-Forward · Accepted Answer · 2021-10-04 02:23:31Z

2

Use apply instead:

def func(x):
    if (x.name % 1000) == 0:
        print("Row node index: {}".format(str(x.name)))
 
    caculated_id = get_id(row['name', row['sex']])
    return caculated_id

df['id'] = df.apply(func, axis=1)

edited Oct 4, 2021 at 2:23

answered Oct 4, 2021 at 1:50

U13-Forward

71.8k15 gold badges100 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

marlon Over a year ago

Does the apply here returns a list of new ids?

U13-Forward Over a year ago

@marlon Yes. Of course.

marlon Over a year ago

I will test the speed on a big csv file.

marlon Over a year ago

If my get_xxx() function returns multiple lists for more general cases, it should also work?

U13-Forward Over a year ago

@marlon Yes, apply returns a series of all the results.

|

yudhiesh · Accepted Answer · 2021-10-04 05:10:07Z

0

If you are working on a large dataset then np.vectorize() should help bypass the apply() overhead, which should be a bit faster.

import numpy as np

v = np.vectorize(lambda x: get_id(x['name'], x['sex']))
df['id'] = v(df)

Edit:

To get even more of a speed up you could also just pass the function get_id instead of using a lambda function and pass df.*.values instead of df.*.

v = np.vectorize(get_id)
df['id'] = v(df['name'].values, df['sex'].values)

Instead of printing updates about the progression through the process try using tqdm to show the progression using a progress bar.

import numpy as np 
from tqdm import tqdm

@np.vectorize
def get_id(name, sex):
    global pbar
    ...
    pbar.update(1)
    ...
    return 


global pbar
with tqdm(total=len(df)) as pbar:
    df['id'] = get_id(df['name'].values, df['sex'].values)

edited Oct 4, 2021 at 5:10

answered Oct 4, 2021 at 2:04

yudhiesh

6,8774 gold badges25 silver badges56 bronze badges

9 Comments

marlon Over a year ago

If get_id returns multiple lists, v{df} is of list of list? get_id could be renamed as get_new_columns.

yudhiesh Over a year ago

@marlon yes it should, it basically works the same as apply(). Please do try it out and let me know if its faster for you.

marlon Over a year ago

i will try for speed.

marlon Over a year ago

In your code, is 'df' missing in the 2nd line?

yudhiesh Over a year ago

@marlon no it doesn't need the df to be passed in v. Also I added in an improvement could you check if it works?

|

Collectives™ on Stack Overflow

How to optimize dataframe iteration in pandas?

2 Answers 2

10 Comments

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related