1

I have this dataset,

PRODUCT_ID   SALE_DATE   SALE_PRICE PROVIDER
1            01/02/16    25         1
1            02/10/16    60         1
1            01/11/16    63         2
1            09/10/16    65         3
2            11/11/15    54         1
2            13/01/16    34         2
3            19/05/14    45         1
3            15/10/15    38         1
3            16/06/14    53         2
3            18/10/15    58         2

This is a combined dataset, data provided by different providers, there is no common identifier for each sale. The issue here is each data provider will have slightly different date and price for each sale. So I am trying to group them together as single sale with a group id. So the business logic here is data provider 1 is the first to get sale data, so for a product id, if the sale date from provider 2 or 3 is within 1 month time and the price is within 10$ difference (more or less), we consider them as same sale, else it will be considered as different sale. So, the output should look like,

PRODUCT_ID   SALE_DATE   SALE_PRICE PROVIDER  SALE_GROUP_ID
1            01/02/16    25         1          1
1            02/10/16    60         1          2
1            01/11/16    63         2          2
1            09/10/16    65         3          2
2            11/11/15    54         1          3 
2            13/01/16    34         2          4
3            19/05/14    45         1          5
3            15/10/15    38         1          6
3            16/06/14    53         2          5
3            18/10/15    58         2          7

How do I achieve this in pandas, can someone help please? Thanks.

4
  • 18/10/15 and 11/11/15 are within one month and have $10 in price difference. Are they same sale? Commented Dec 18, 2017 at 3:20
  • Nope. Because, they have different product_id. PRODUCT_ID should be the same to be considered as same sale.. Commented Dec 18, 2017 at 3:21
  • 19/05/14 and 16/06/14 have same PRODUCT_ID, are within one month, have $10 in price difference, and are same sale. Do their sale IDs need to be in sequence? Commented Dec 18, 2017 at 3:44
  • Yes, that is why they are considered as same sale. Product_ID is same, within one month and have 10$ difference. So considered as same sale and sale_group_id is same. Nope, they don't need to be in sequence. Commented Dec 18, 2017 at 3:46

1 Answer 1

1

We can using groupby after sort_values

df.SALE_DATE=pd.to_datetime(df.SALE_DATE,format='%d/%m/%y')
df=df.sort_values(['PRODUCT_ID','SALE_DATE','SALE_PRICE'])
df1=df.copy()
df1=df.copy().sort_index()


df1['SALE_GROUP_ID']=df.groupby([df['PRODUCT_ID'],(df['SALE_DATE'].diff().abs()/np.timedelta64(1, 'M')).lt(1).ne(1).cumsum(),df['SALE_PRICE'].diff().lt(10).ne(1).cumsum()]).ngroup().add(1).sort_index().values
df1
Out[294]: 
   PRODUCT_ID  SALE_DATE  SALE_PRICE  PROVIDER  SALE_GROUP_ID
0           1 2016-02-01          25         1              1
1           1 2016-10-02          60         1              2
2           1 2016-11-01          63         2              2
3           1 2016-10-09          65         3              2
4           2 2015-11-11          54         1              3
5           2 2016-01-13          34         2              4
6           3 2014-05-19          45         1              5
7           3 2015-10-15          38         1              6
8           3 2014-06-16          53         2              5
9           3 2015-10-18          58         2              7

More info: you require 3 condition when get the group ,

 1. PRODUCT_ID
 2. Date different within in 1 month
 3. SALE_PRICE different less than 10

(df['SALE_DATE'].diff().abs()/np.timedelta64(1, 'M')).lt(1).ne(1).cumsum(), is to getthing the group of date within one month

(df['SALE_DATE'].diff().abs()/np.timedelta64(1, 'M')).lt(1).ne(1).cumsum()
Out[274]: 
0    1
1    2
3    2
2    2
4    3
5    4
6    5
8    5
7    6
9    6
Name: SALE_DATE, dtype: int32

df['SALE_PRICE'].diff().lt(10).ne(1).cumsum() is to get the value within 10

df['SALE_PRICE'].diff().lt(10).ne(1).cumsum()
Out[275]: 
0    1
1    2
3    2
2    2
4    2
5    2
6    3
8    3
7    3
9    4
Name: SALE_PRICE, dtype: int32

After the groupby we just combine (getting the intersection of each group condition), then we using ngroup to get the group Id

Sign up to request clarification or add additional context in comments.

4 Comments

@ds_user sorry I can not using the chat ,, you can change to df['SALE_DATE'].diff()/np.timedelta64(1, 'D') this part lt(50) is to handle the bin size
Sorry, I still don't understand this. So to mention the day difference, I have to modify 'lt' alone or this also np.timedelta64(1, 'D') to np.timedelta64(50, 'D') . Is this correct ? df['sale_date'].diff().abs()/np.timedelta64(1, 'D')).lt(50).ne(1).cumsum(),df['net_sale_price'].pct_change().lt(0.1).ne(1).cumsum() ?
@ds_user you just need to np.timedelta64(1, 'D'), this is to convert the days object to int

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.