I have a user-own metric to implement as follows:
def metric(pred:pd.DataFrame(), valid:pd.DataFrame()):
date_begin = valid.dt.min()
date_end = valid.dt.max()
x = valid[valid.label == 1].dt.min()
# p
p_n_tpp_df = valid[(valid.dt >= x) &\
(valid.dt <= x + timedelta(days=30)) &\
(p_n_tpp_df.label == 1)]
p_n_pp_df = valid[(valid.dt >= date_begin + timedelta(days=30)) &\
(valid.dt <= date_end + timedelta(days=30)) &\
(p_n_tpp_df.label == 1)]
p_n_tpp = len([x for x in pred.serial_number.values\
if x in p_n_tpp_df.serial_number.unique()])
p_n_pp = len([x for x in pred.serial_number.values\
if x in p_n_pp_df.serial_number.unique()])
p = p_n_tpp / p_n_pp
print('p: ', p)
# r
p_n_tpr_df = valid[(valid.dt >= date_begin - timedelta(days=30)) &\
(valid.dt <= date_end - timedelta(days=30)) &\
(p_n_tpr_df.label == 1)]
p_n_pr_df = valid[(valid.dt >= date_begin) &\
(valid.dt <= date_end) &\
(p_n_pr_df.label == 1)]
p_n_tpr = len([x for x in pred.serial_number.values\
if x in p_n_tpr_df.serial_number.unique()])
p_n_pr = len([x for x in pred.serial_number.values\
if x in p_n_pr_df.serial_number.unique()])
r = p_n_tpr / p_n_pr
print('p: ', r)
m = 2 * p * r / (p + r)
return m
The pd.DataFrame() of pred and valid have the same columns and dt has no intersections.
And the all the values of serial_number in valid is a subset of all the values of serial_number in pred.
The label column only has 2 values: 0 or 1.
Here is the sample of pred and valid is as follows:
print(pred.head(3))
serial_number dt label
0 123 2011-03-21 1
1 52 2011-03-22 0
2 12 2011-03-01 1
..., ...
print(pred.info())
Int64Index: 10000000 entries,
Data columns (total 3 columns):
serial_number int32
dt datetimes64[ns]
label int8
..., ...
print(valid.head(3))
serial_number dt label
0 324 2011-04-22 1
1 52 2011-04-22 0
2 14 2011-04-01 1
..., ...
print(valid.info())
Int64Index: 10000000 entries,
Data columns (total 3 columns):
serial_number int32
dt datetimes64[ns]
label int8
And the size of input pd.DataFrame is about 10, 000, 000 samples and 3 features.
When I try to use it to calculate this metric, it is really slow and time spending is more than 2 hours on Intel 9600KF.
So I am wondering how to optimize such code on time cost.
Thanks in advance.
numberbeserial_number?print(pred.head(3))output. Andnumberis same as theserial_number, I just correct it.predandvalid