I use pandas for an ETL process. I query a database I put the result in a dataframe; the dataframe is pretty big (1M rows * 50 columns). The dataframe is mainly composed of string and date.
I use the apply() function of pandas to make all my transformations. The thing is my transformation contains multiple branching on string.
df['merged_contract_status'] = df.apply(lib.merged_contract_status, axis=1)
df['renewed'] = df.apply(lib.contract_renewed, axis=1)
df['renewal_dt'] = df.apply(lib.contract_renewed_date, axis=1)
....
I have a bunch of transformation like that. The functions I call:
def merged_contract_status(row):
if row['tbm_contract_status'] is not np.nan:
mergedContractStatus = row['tbm_contract_status']
else:
mergedContractStatus = row['ccn_status']
return mergedContractStatus
def contract_renewed(row):
ccn_activation_dt = row['ccn_activation_dt']
ccn_status = row['ccn_status']
f_init_val_dt = row['f_init_val_dt']
f_nat_exp_dt = row['f_nat_exp_dt']
spr_spo_code = row['spr_spo_code']
csp_status_sep_1 = row['csp_status_sep_1']
csp_begin_dt_sep_1 = row['csp_begin_dt_sep_1']
ctt_type_1 = row['ctt_type_1']
csp_status_sep_2 = row['csp_status_sep_2']
csp_begin_dt_sep_2 = row['csp_begin_dt_sep_2']
ctt_type_2 = row['ctt_type_2']
csp_status_sep_3 = row['csp_status_sep_3']
csp_begin_dt_sep_3 = row['csp_begin_dt_sep_3']
ctt_type_3 = row['ctt_type_3']
csp_begin_dt_sep_father = row['csp_begin_dt_sep_father']
csp_end_dt_sep_father = row['csp_end_dt_sep_father']
todayDate = datetime.datetime.today()
if spr_spo_code == 'PCC':
if ctt_type_1 == 'NORMAL' and ccn_activation_dt is not None and csp_begin_dt_sep_1 is not None and csp_begin_dt_sep_1>(ccn_activation_dt+timedelta(365)):
return 'Y'
elif ctt_type_2 == 'NORMAL' and ccn_activation_dt is not None and csp_begin_dt_sep_2 is not None and csp_begin_dt_sep_2>(ccn_activation_dt+timedelta(365)):
return 'Y'
elif ctt_type_3== 'NORMAL' and ccn_activation_dt is not None and csp_begin_dt_sep_3 is not None and csp_begin_dt_sep_3>(ccn_activation_dt+timedelta(365)):
return 'Y'
else:
return 'N'
else:
if (f_init_val_dt is None and f_nat_exp_dt is None and
ccn_activation_dt is not None and
ccn_activation_dt < (todayDate- timedelta(365)) and
(csp_begin_dt_sep_father <= todayDate and csp_begin_dt_sep_father >= todayDate and ccn_status=='ACTIVATED')):
return 'Y'
elif f_init_val_dt is not None and f_nat_exp_dt is not None and f_init_val_dt <= todayDate and f_nat_exp_dt >= todayDate and ccn_status=='ACTIVATED' and ccn_activation_dt is not None and ccn_activation_dt < (todayDate- timedelta(365)):
return 'Y'
else:
return 'N'
Each time I call apply on my df, pandas loops through the entire df and it takes around 10 min. I'm OK with that 10 min; I know I cannot improve performance. But is there a way to avoid multiple loop? Can pandas loop only once and do all the transformation I want?
EDIT: It's hard to give you data since the dataframe is big and it is built with sql query. The help I want is a way to loop only once through the dataframe, I do not want to improve each function (taht's kind of impossible with those branching on string)
Thanks