How to optimize such codes as follows in python?

Question

I have a user-own metric to implement as follows:

def metric(pred:pd.DataFrame(), valid:pd.DataFrame()):
    date_begin = valid.dt.min()
    date_end = valid.dt.max()
    x = valid[valid.label == 1].dt.min()

    # p
    p_n_tpp_df = valid[(valid.dt >= x) &\
                       (valid.dt <= x + timedelta(days=30)) &\
                       (p_n_tpp_df.label == 1)]
    p_n_pp_df =  valid[(valid.dt >= date_begin + timedelta(days=30)) &\ 
                       (valid.dt <= date_end + timedelta(days=30)) &\
                       (p_n_tpp_df.label == 1)]


    p_n_tpp = len([x for x in pred.serial_number.values\ 
                     if x in p_n_tpp_df.serial_number.unique()])
    p_n_pp = len([x for x in pred.serial_number.values\ 
                    if x in p_n_pp_df.serial_number.unique()])

    p = p_n_tpp / p_n_pp
    print('p: ', p)

    # r
    p_n_tpr_df = valid[(valid.dt >= date_begin - timedelta(days=30)) &\ 
                      (valid.dt <= date_end - timedelta(days=30)) &\
                      (p_n_tpr_df.label == 1)]
    p_n_pr_df = valid[(valid.dt >= date_begin) &\ 
                      (valid.dt <= date_end) &\ 
                      (p_n_pr_df.label == 1)]


    p_n_tpr = len([x for x in pred.serial_number.values\
                     if x in p_n_tpr_df.serial_number.unique()])
    p_n_pr = len([x for x in pred.serial_number.values\
                    if x in p_n_pr_df.serial_number.unique()])

    r = p_n_tpr / p_n_pr
    print('p: ', r)

    m = 2 * p * r / (p + r)

    return m

The pd.DataFrame() of pred and valid have the same columns and dt has no intersections.
And the all the values of serial_number in valid is a subset of all the values of serial_number in pred.
The label column only has 2 values: 0 or 1.
Here is the sample of pred and valid is as follows:


print(pred.head(3))
    serial_number  dt          label  
0   123            2011-03-21  1
1   52             2011-03-22  0
2   12             2011-03-01  1
..., ...


print(pred.info())
Int64Index: 10000000 entries,
Data columns (total 3 columns):
serial_number  int32
dt             datetimes64[ns]
label          int8
..., ...

print(valid.head(3))
    serial_number  dt          label  
0   324            2011-04-22  1
1   52             2011-04-22  0
2   14             2011-04-01  1
..., ...


print(valid.info())
Int64Index: 10000000 entries,
Data columns (total 3 columns):
serial_number  int32
dt             datetimes64[ns]
label          int8

And the size of input pd.DataFrame is about 10, 000, 000 samples and 3 features.
When I try to use it to calculate this metric, it is really slow and time spending is more than 2 hours on Intel 9600KF.
So I am wondering how to optimize such code on time cost.
Thanks in advance.

@ItamarMushkin The sample of dataset is just like the print(pred.head(3)) output. And number is same as the serial_number, I just correct it. — Bowen Peng
– Bowen Peng, Commented Mar 8, 2020 at 13:18
To help you out, we need a sample of both pred and valid — Itamar Mushkin
– Itamar Mushkin, Commented Mar 8, 2020 at 15:09
@ItamarMushkin I have updated the details of it. Thanks for your tips. — Bowen Peng
– Bowen Peng, Commented Mar 8, 2020 at 17:26

hume · Accepted Answer · 2020-03-12 16:26:18Z

6

+50

Here is the biggest performance win in the code that you have:

Numpy set logic

len([x for x in pred.serial_number.values\
                     if x in p_n_tpr_df.serial_number.unique()])

Any line that looks like this is getting the size of the set intersection of pred.serial_number and p_n_tpr_df.serial_number. Using numpy rather than the list comprehension and the unique call will save substantial compute time:

intersect_size = np.intersect1d(pred.serial_number.values,
                                p_n_tpr_df.serial_number.values).shape[0]

answered Mar 12, 2020 at 16:26

hume

2,61325 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

hume Over a year ago

@bowen-peng did this work for you and if so can you accept this answer?

Collectives™ on Stack Overflow

How to optimize such codes as follows in python?

1 Answer 1

Numpy set logic

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Numpy set logic

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related