2

I would like to get the distinct count of values in a python pandas dataframe and write the result to a new column. This is what I have so far.

import pandas as pd

df = pd.DataFrame( {
   'OrderNo': [1,1,1,1,2,2,2,3,3],
   'Barcode': [1234,2345,3456,3456,1234,1234,2345,1234,3456]
    } );

df['barcodeCountPerOrderNo'] = df.groupby(['OrderNo', 'Barcode'])['Barcode'].transform('count')

df['distinctBarcodesPerOrderNo'] = '?'

print df

This gives:

   Barcode  OrderNo  barcodeCountPerOrderNo distinctBarcodesPerOrder
0     1234        1                       1                       ?
1     2345        1                       1                       ?
2     3456        1                       2                       ?
3     3456        1                       2                       ?
4     1234        2                       2                       ?
5     1234        2                       2                       ?
6     2345        2                       1                       ?
7     1234        3                       1                       ?
8     3456        3                       1                       ?

But how can I get the distinctBarcodesPerOrder?

   Barcode  OrderNo  barcodeCountPerOrderNo distinctBarcodesPerOrder
0     1234        1                       1                       3
1     2345        1                       1                       3
2     3456        1                       2                       3
3     3456        1                       2                       3
4     1234        2                       2                       2
5     1234        2                       2                       2
6     2345        2                       1                       2
7     1234        3                       1                       2
8     3456        3                       1                       2
3
  • you can use drop_duplicates method. See the following document for the datails: pandas.pydata.org/pandas-docs/stable/generated/… Commented May 8, 2017 at 12:22
  • I am sorry but it is not clear how do you suppose to receive distinctBarcodesPerOrder column. Could you clarify? Perhaps df.distinctBarcodesPerOrder.unique() can do the trick? Commented May 8, 2017 at 12:36
  • that won't work since, I would like to have the count of distinct barcodes per order (df.distinctBarcodesPerOrder.unique() gives the count over the entire dataframe). Commented May 8, 2017 at 12:40

3 Answers 3

3

You can use nunique to calculate the number of unique barcodes per order

Barcode_distinct = df.groupby('OrderNo')['Barcode'].nunique()

the result is pandas Series

> OrderNo
> 1    3
> 2    2
> 3    2
> Name: Barcode, dtype: int64

then you merge this with the original DataFrame

df.merge(Barcode_distinct.to_frame(), left_on='OrderNo', right_index=True, suffixes=('', '_unique_per_OrderNo'))

the results is

>    Barcode  OrderNo  Barcode_unique_per_OrderNo
> 0     1234        1                           3
> 1     2345        1                           3
> 2     3456        1                           3
> 3     3456        1                           3
> 4     1234        2                           2
> 5     1234        2                           2
> 6     2345        2                           2
> 7     1234        3                           2
> 8     3456        3                           2
Sign up to request clarification or add additional context in comments.

1 Comment

thanks for being the first with a working solution. I accepted Fabio Lamanna's solution because it was slightly shorter.
1

I would use map to get unique values and directly merge them into the original dataframe:

df['distinctBarcodesPerOrder'] = df['OrderNo'].map(df.groupby('OrderNo')['Barcode'].nunique())

which returns:

   Barcode  OrderNo  barcodeCountPerOrderNo  distinctBarcodesPerOrder
0     1234        1                       1                         3
1     2345        1                       1                         3
2     3456        1                       2                         3
3     3456        1                       2                         3
4     1234        2                       2                         2
5     1234        2                       2                         2
6     2345        2                       1                         2
7     1234        3                       1                         2
8     3456        3                       1                         2

1 Comment

for the sake of elegance and simplicity... this is the working and accepted answer. thx!!
1
#If you want a one-liner, you can use apply to get the distinctBarcodesPerOrder for each row. Although this method might be a slow on large dataset. 

df['distinctBarcodesPerOrder'] = df.apply(lambda x: df.loc[df.OrderNo==x.OrderNo,'Barcode'].nunique(), axis=1)

df
Out[237]: 
   Barcode  OrderNo  barcodeCountPerOrderNo  distinctBarcodesPerOrder
0     1234        1                       1                         3
1     2345        1                       1                         3
2     3456        1                       2                         3
3     3456        1                       2                         3
4     1234        2                       2                         2
5     1234        2                       2                         2
6     2345        2                       1                         2
7     1234        3                       1                         2
8     3456        3                       1                         2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.