Replace Slow Pandas Loop With Vectorized Function

Question

I have a loop in pandas that is really slow (ten plus minutes). I am trying to replace it with a vectorized function, but can't think of what to use. There are multiple records that have different household numbers but the same relationship group number, and if a record's household number is the same as the relationship group number then I want to use the officer number and name for that record for all records with that relationship group number (including if household number is different). See code below:

        rg['RG Officer Number'] = pd.np.nan
        rg['RG Officer Name'] = pd.np.nan
        for index, row in rg.iterrows():
            if row['Relationship Group'] == row['Household Number']:
                mask = rg['Relationship Group'] == row['Relationship Group']
                rg.loc[mask, 'RG Officer Number'] = row['Household Primary Officer Number']
                rg.loc[mask, 'RG Officer Name'] = row['Household Primary Officer Name']

I tried the below, but I got an error (cannot use a single bool to index into setitem). I think I am completely off track. Maybe this is impossible with a vectorized function, but it seems it should not be.

        mask = row['Relationship Group'] == row['Household Number']
        rg.loc[mask, 'RG Officer Number'] = rg.loc['Household Primary Officer Number']

Any help you offer would be appreciated.

Could you provide us with a sample of data to work with? A few rows of your Dataframe should suffice — Ralubrusto
– Ralubrusto, Commented Oct 9, 2020 at 21:51

cookesd · Accepted Answer · 2020-10-09 22:10:03Z

A filter and merge would work.

df = pd.DataFrame({'Household Number':[str(i) for i in range(10)],
                   'Relationship Number':[str(i) for i in range(5)]*2,
                   'RG Officer Number':np.random.randint(1,100,10),
                   'RG Officer Name':['name'+str(i) for i in np.random.randint(1,100,10)]})

df
#  Household Number Relationship Number  RG Officer Number RG Officer Name
#0                0                   0                 28          name87
#1                1                   1                 18          name71
#2                2                   2                 69           name8
#3                3                   3                 83          name64
#4                4                   4                 88          name36
#5                5                   0                 25          name89
#6                6                   1                 51          name76
#7                7                   2                 29          name80
#8                8                   3                 61          name27
#9                9                   4                  2          name95


df_filtered = df.loc[df['Household Number'] == df['Relationship Number']]
df_filtered
#  Household Number Relationship Number  RG Officer Number RG Officer Name
#0                0                   0                 28          name87
#1                1                   1                 18          name71
#2                2                   2                 69           name8
#3                3                   3                 83          name64
#4                4                   4                 88          name36

df_merged = pd.merge(left=df,right=df_filtered[['Relationship Number','RG Officer Number','RG Officer Name']],
                     how='left',
                     on='Relationship Number',suffixes=('_old','_new'))

Here's the merged data.

Thanks, this does the trick and only takes only a second to run. This is being scheduled to run daily along with some other scripts, so speed is very important.

Collectives™ on Stack Overflow

Replace Slow Pandas Loop With Vectorized Function

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related