2

I need to iterate a dataframe, for each row I need to create a ID based on two existing columns: name and sex. Eventually I add this new column to the df.

df = pd.read_csv(file, sep='\t', dtype=str, na_values="", low_memory=False)
   row_ids = []
   for index, row in df.iterrows():
       if (index % 1000) == 0:
          print("Row node index: {}".format(str(index)))
     
     caculated_id = get_id(row['name', row['sex']])
     row_ids.append(caculated_id)

   df['id'] = row_ids

Is there a way to make it much faster without going row by row?

Add more info based on suggested solutions:

9
  • Could you include the function get_id and a sample of the df? Commented Oct 4, 2021 at 1:57
  • It is a regular function, taking input and return anything. Just for example purpose. Commented Oct 4, 2021 at 1:59
  • How is the id constructed? Include a small sample dataframe. I'm not sure what get_id(row['name', row['sex']]) is supposed to do. Commented Oct 4, 2021 at 1:59
  • id=hash(name+sex) Commented Oct 4, 2021 at 2:01
  • @marlon - pandas lets you perform operations in bulk. Suppose id is just the concatenation of name and sex. You could do df['id'] = df['name'] + df['sex']. Instead of a function that does something to indvidual cells, see if you can do things with entire columns. Commented Oct 4, 2021 at 2:03

2 Answers 2

2

Use apply instead:

def func(x):
    if (x.name % 1000) == 0:
        print("Row node index: {}".format(str(x.name)))
 
    caculated_id = get_id(row['name', row['sex']])
    return caculated_id

df['id'] = df.apply(func, axis=1)
Sign up to request clarification or add additional context in comments.

10 Comments

Does the apply here returns a list of new ids?
@marlon Yes. Of course.
I will test the speed on a big csv file.
If my get_xxx() function returns multiple lists for more general cases, it should also work?
@marlon Yes, apply returns a series of all the results.
|
0

If you are working on a large dataset then np.vectorize() should help bypass the apply() overhead, which should be a bit faster.

import numpy as np

v = np.vectorize(lambda x: get_id(x['name'], x['sex']))
df['id'] = v(df)

Edit:

To get even more of a speed up you could also just pass the function get_id instead of using a lambda function and pass df.*.values instead of df.*.

v = np.vectorize(get_id)
df['id'] = v(df['name'].values, df['sex'].values)

Instead of printing updates about the progression through the process try using tqdm to show the progression using a progress bar.

import numpy as np 
from tqdm import tqdm

@np.vectorize
def get_id(name, sex):
    global pbar
    ...
    pbar.update(1)
    ...
    return 


global pbar
with tqdm(total=len(df)) as pbar:
    df['id'] = get_id(df['name'].values, df['sex'].values)

9 Comments

If get_id returns multiple lists, v{df} is of list of list? get_id could be renamed as get_new_columns.
@marlon yes it should, it basically works the same as apply(). Please do try it out and let me know if its faster for you.
i will try for speed.
In your code, is 'df' missing in the 2nd line?
@marlon no it doesn't need the df to be passed in v. Also I added in an improvement could you check if it works?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.