5

I have a user-own metric to implement as follows:

def metric(pred:pd.DataFrame(), valid:pd.DataFrame()):
    date_begin = valid.dt.min()
    date_end = valid.dt.max()
    x = valid[valid.label == 1].dt.min()

    # p
    p_n_tpp_df = valid[(valid.dt >= x) &\
                       (valid.dt <= x + timedelta(days=30)) &\
                       (p_n_tpp_df.label == 1)]
    p_n_pp_df =  valid[(valid.dt >= date_begin + timedelta(days=30)) &\ 
                       (valid.dt <= date_end + timedelta(days=30)) &\
                       (p_n_tpp_df.label == 1)]


    p_n_tpp = len([x for x in pred.serial_number.values\ 
                     if x in p_n_tpp_df.serial_number.unique()])
    p_n_pp = len([x for x in pred.serial_number.values\ 
                    if x in p_n_pp_df.serial_number.unique()])

    p = p_n_tpp / p_n_pp
    print('p: ', p)

    # r
    p_n_tpr_df = valid[(valid.dt >= date_begin - timedelta(days=30)) &\ 
                      (valid.dt <= date_end - timedelta(days=30)) &\
                      (p_n_tpr_df.label == 1)]
    p_n_pr_df = valid[(valid.dt >= date_begin) &\ 
                      (valid.dt <= date_end) &\ 
                      (p_n_pr_df.label == 1)]


    p_n_tpr = len([x for x in pred.serial_number.values\
                     if x in p_n_tpr_df.serial_number.unique()])
    p_n_pr = len([x for x in pred.serial_number.values\
                    if x in p_n_pr_df.serial_number.unique()])

    r = p_n_tpr / p_n_pr
    print('p: ', r)

    m = 2 * p * r / (p + r)

    return m

The pd.DataFrame() of pred and valid have the same columns and dt has no intersections.
And the all the values of serial_number in valid is a subset of all the values of serial_number in pred.
The label column only has 2 values: 0 or 1.
Here is the sample of pred and valid is as follows:


print(pred.head(3))
    serial_number  dt          label  
0   123            2011-03-21  1
1   52             2011-03-22  0
2   12             2011-03-01  1
..., ...


print(pred.info())
Int64Index: 10000000 entries,
Data columns (total 3 columns):
serial_number  int32
dt             datetimes64[ns]
label          int8
..., ...

print(valid.head(3))
    serial_number  dt          label  
0   324            2011-04-22  1
1   52             2011-04-22  0
2   14             2011-04-01  1
..., ...


print(valid.info())
Int64Index: 10000000 entries,
Data columns (total 3 columns):
serial_number  int32
dt             datetimes64[ns]
label          int8

And the size of input pd.DataFrame is about 10, 000, 000 samples and 3 features.
When I try to use it to calculate this metric, it is really slow and time spending is more than 2 hours on Intel 9600KF.
So I am wondering how to optimize such code on time cost.
Thanks in advance.

10
  • 3
    Can you provide an example dataset? Commented Mar 8, 2020 at 10:44
  • should number be serial_number? Commented Mar 8, 2020 at 10:46
  • @ItamarMushkin The sample of dataset is just like the print(pred.head(3)) output. And number is same as the serial_number, I just correct it. Commented Mar 8, 2020 at 13:18
  • To help you out, we need a sample of both pred and valid Commented Mar 8, 2020 at 15:09
  • @ItamarMushkin I have updated the details of it. Thanks for your tips. Commented Mar 8, 2020 at 17:26

1 Answer 1

6
+50

Here is the biggest performance win in the code that you have:

Numpy set logic

len([x for x in pred.serial_number.values\
                     if x in p_n_tpr_df.serial_number.unique()])

Any line that looks like this is getting the size of the set intersection of pred.serial_number and p_n_tpr_df.serial_number. Using numpy rather than the list comprehension and the unique call will save substantial compute time:

intersect_size = np.intersect1d(pred.serial_number.values,
                                p_n_tpr_df.serial_number.values).shape[0]
Sign up to request clarification or add additional context in comments.

1 Comment

@bowen-peng did this work for you and if so can you accept this answer?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.