How to apply multiple functions to a pandas dataframe without multiple loops?

Question

I use pandas for an ETL process. I query a database I put the result in a dataframe; the dataframe is pretty big (1M rows * 50 columns). The dataframe is mainly composed of string and date.

I use the apply() function of pandas to make all my transformations. The thing is my transformation contains multiple branching on string.

df['merged_contract_status'] = df.apply(lib.merged_contract_status, axis=1) 
df['renewed'] = df.apply(lib.contract_renewed, axis=1)
df['renewal_dt'] = df.apply(lib.contract_renewed_date, axis=1)
....

I have a bunch of transformation like that. The functions I call:

def merged_contract_status(row):
    if row['tbm_contract_status'] is not np.nan:
        mergedContractStatus = row['tbm_contract_status']
    else:
        mergedContractStatus = row['ccn_status']

    return mergedContractStatus

def contract_renewed(row):
    ccn_activation_dt = row['ccn_activation_dt']
    ccn_status = row['ccn_status']
    f_init_val_dt = row['f_init_val_dt']
    f_nat_exp_dt = row['f_nat_exp_dt']
    spr_spo_code = row['spr_spo_code']  
    csp_status_sep_1 = row['csp_status_sep_1']
    csp_begin_dt_sep_1 = row['csp_begin_dt_sep_1']
    ctt_type_1 = row['ctt_type_1']  
    csp_status_sep_2 = row['csp_status_sep_2']
    csp_begin_dt_sep_2 = row['csp_begin_dt_sep_2']
    ctt_type_2 = row['ctt_type_2']  
    csp_status_sep_3 = row['csp_status_sep_3']
    csp_begin_dt_sep_3 = row['csp_begin_dt_sep_3']
    ctt_type_3 = row['ctt_type_3']
    csp_begin_dt_sep_father = row['csp_begin_dt_sep_father']
    csp_end_dt_sep_father = row['csp_end_dt_sep_father']
    todayDate = datetime.datetime.today()       

    if spr_spo_code == 'PCC':       
        if ctt_type_1 == 'NORMAL' and ccn_activation_dt is not None and csp_begin_dt_sep_1 is not None and csp_begin_dt_sep_1>(ccn_activation_dt+timedelta(365)):
            return 'Y'
        elif ctt_type_2 == 'NORMAL' and ccn_activation_dt is not None and csp_begin_dt_sep_2 is not None and csp_begin_dt_sep_2>(ccn_activation_dt+timedelta(365)):
            return 'Y'
        elif ctt_type_3== 'NORMAL' and ccn_activation_dt is not None and csp_begin_dt_sep_3 is not None and csp_begin_dt_sep_3>(ccn_activation_dt+timedelta(365)):
            return 'Y'
        else:
            return 'N'
    else:
        if (f_init_val_dt is None and f_nat_exp_dt is None and 
            ccn_activation_dt is not None and 
            ccn_activation_dt < (todayDate- timedelta(365)) and 
            (csp_begin_dt_sep_father <= todayDate and csp_begin_dt_sep_father >= todayDate and ccn_status=='ACTIVATED')):
            return 'Y'
        elif f_init_val_dt is not None and f_nat_exp_dt is not None and f_init_val_dt <= todayDate and f_nat_exp_dt >= todayDate and ccn_status=='ACTIVATED' and ccn_activation_dt is not None and ccn_activation_dt < (todayDate- timedelta(365)):
            return 'Y'
        else:
            return 'N'

Each time I call apply on my df, pandas loops through the entire df and it takes around 10 min. I'm OK with that 10 min; I know I cannot improve performance. But is there a way to avoid multiple loop? Can pandas loop only once and do all the transformation I want?

EDIT: It's hard to give you data since the dataframe is big and it is built with sql query. The help I want is a way to loop only once through the dataframe, I do not want to improve each function (taht's kind of impossible with those branching on string)

Thanks

It would help if you showed what your function is doing, some sample data, code to reproduce your df and the desired df — EdChum
– EdChum, Commented Jan 7, 2016 at 15:47
Perhaps a .groupby().agg() method would help? For example: stackoverflow.com/questions/22128218/… — crow_t_robot
– crow_t_robot, Commented Jan 7, 2016 at 16:31

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

Python and Pandas can only do one thing at a time. Your functions does a lot of things though and you could build them together to one function. But that is not actually your problem here.

Applying functions with axis=1 is super heavy, because you are iterating over the whole dataframe.

You are using pandas, but you are not using pandas.

You should rewrite the functions you apply into vectorized operations.

It looks like the .loc method could help you out a lot

Index your dataframe on anything you are using in the if statements

df.set_index(['spr_spo_code', 'ctt_type_1', 'ccn_activation_dt'], inplace=True)

and so on.

Then you can use the .loc, but first you should create a column where you want the result.

df['Contract_renewed'] = 'N'
df.loc['PCC', 'NORMAL', None ..., 'Contract_renewed'] = 'Y'

And this will be crazy much faster than your current way of operation.

Tiny example

>>> df = pd.DataFrame({'foo':[1,2,3], 'bar':['baz','boo','bee'], 'baz':['N']*3})
>>> df.set_index(['bar', 'foo'], inplace=True)
>>> df.loc[('baz', 1), 'baz'] = 'Y'
>>> df
        baz
bar foo    
baz 1     Y
boo 2     N
bee 3     N

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Jan 7, 2016 at 16:06

firelynx

32.5k10 gold badges94 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Thomas Over a year ago

Yes I agree with that and I know that. But as you can guess it's impossible to rewrite my function into vectorized operation since I deal with string.

firelynx Over a year ago

it's not impossible. Pandas provides lots of nice vectorised operations for strings too. pandas.pydata.org/pandas-docs/version/0.15.0/…

Collectives™ on Stack Overflow

How to apply multiple functions to a pandas dataframe without multiple loops?

1 Answer 1

Tiny example

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Tiny example

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related