pandas grouping data frame - python

Question

I have this dataset,

PRODUCT_ID   SALE_DATE   SALE_PRICE PROVIDER
1            01/02/16    25         1
1            02/10/16    60         1
1            01/11/16    63         2
1            09/10/16    65         3
2            11/11/15    54         1
2            13/01/16    34         2
3            19/05/14    45         1
3            15/10/15    38         1
3            16/06/14    53         2
3            18/10/15    58         2

This is a combined dataset, data provided by different providers, there is no common identifier for each sale. The issue here is each data provider will have slightly different date and price for each sale. So I am trying to group them together as single sale with a group id. So the business logic here is data provider 1 is the first to get sale data, so for a product id, if the sale date from provider 2 or 3 is within 1 month time and the price is within 10$ difference (more or less), we consider them as same sale, else it will be considered as different sale. So, the output should look like,

PRODUCT_ID   SALE_DATE   SALE_PRICE PROVIDER  SALE_GROUP_ID
1            01/02/16    25         1          1
1            02/10/16    60         1          2
1            01/11/16    63         2          2
1            09/10/16    65         3          2
2            11/11/15    54         1          3 
2            13/01/16    34         2          4
3            19/05/14    45         1          5
3            15/10/15    38         1          6
3            16/06/14    53         2          5
3            18/10/15    58         2          7

How do I achieve this in pandas, can someone help please? Thanks.

18/10/15 and 11/11/15 are within one month and have $10 in price difference. Are they same sale? — SKN
– SKN, Commented Dec 18, 2017 at 3:20
Nope. Because, they have different product_id. PRODUCT_ID should be the same to be considered as same sale.. — ds_user
– ds_user, Commented Dec 18, 2017 at 3:21
19/05/14 and 16/06/14 have same PRODUCT_ID, are within one month, have $10 in price difference, and are same sale. Do their sale IDs need to be in sequence? — SKN
– SKN, Commented Dec 18, 2017 at 3:44
Yes, that is why they are considered as same sale. Product_ID is same, within one month and have 10$ difference. So considered as same sale and sale_group_id is same. Nope, they don't need to be in sequence. — ds_user
– ds_user, Commented Dec 18, 2017 at 3:46

BENY · Accepted Answer · 2017-12-18 04:18:05Z

1

We can using groupby after sort_values

df.SALE_DATE=pd.to_datetime(df.SALE_DATE,format='%d/%m/%y')
df=df.sort_values(['PRODUCT_ID','SALE_DATE','SALE_PRICE'])
df1=df.copy()
df1=df.copy().sort_index()


df1['SALE_GROUP_ID']=df.groupby([df['PRODUCT_ID'],(df['SALE_DATE'].diff().abs()/np.timedelta64(1, 'M')).lt(1).ne(1).cumsum(),df['SALE_PRICE'].diff().lt(10).ne(1).cumsum()]).ngroup().add(1).sort_index().values
df1
Out[294]: 
   PRODUCT_ID  SALE_DATE  SALE_PRICE  PROVIDER  SALE_GROUP_ID
0           1 2016-02-01          25         1              1
1           1 2016-10-02          60         1              2
2           1 2016-11-01          63         2              2
3           1 2016-10-09          65         3              2
4           2 2015-11-11          54         1              3
5           2 2016-01-13          34         2              4
6           3 2014-05-19          45         1              5
7           3 2015-10-15          38         1              6
8           3 2014-06-16          53         2              5
9           3 2015-10-18          58         2              7

More info: you require 3 condition when get the group ,

 1. PRODUCT_ID
 2. Date different within in 1 month
 3. SALE_PRICE different less than 10

(df['SALE_DATE'].diff().abs()/np.timedelta64(1, 'M')).lt(1).ne(1).cumsum(), is to getthing the group of date within one month

(df['SALE_DATE'].diff().abs()/np.timedelta64(1, 'M')).lt(1).ne(1).cumsum()
Out[274]: 
0    1
1    2
3    2
2    2
4    3
5    4
6    5
8    5
7    6
9    6
Name: SALE_DATE, dtype: int32

df['SALE_PRICE'].diff().lt(10).ne(1).cumsum() is to get the value within 10

df['SALE_PRICE'].diff().lt(10).ne(1).cumsum()
Out[275]: 
0    1
1    2
3    2
2    2
4    2
5    2
6    3
8    3
7    3
9    4
Name: SALE_PRICE, dtype: int32

After the groupby we just combine (getting the intersection of each group condition), then we using ngroup to get the group Id

edited Dec 18, 2017 at 4:18

answered Dec 18, 2017 at 3:39

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

ds_user Over a year ago

Let us continue this discussion in chat.

BENY Over a year ago

@ds_user sorry I can not using the chat ,, you can change to df['SALE_DATE'].diff()/np.timedelta64(1, 'D') this part lt(50) is to handle the bin size

ds_user Over a year ago

Sorry, I still don't understand this. So to mention the day difference, I have to modify 'lt' alone or this also np.timedelta64(1, 'D') to np.timedelta64(50, 'D') . Is this correct ?

df['sale_date'].diff().abs()/np.timedelta64(1, 'D')).lt(50).ne(1).cumsum(),df['net_sale_price'].pct_change().lt(0.1).ne(1).cumsum()

?

BENY Over a year ago

@ds_user you just need to np.timedelta64(1, 'D'), this is to convert the days object to int

Collectives™ on Stack Overflow

pandas grouping data frame - python

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related