pandas unique values multiple columns

Question

df = pd.DataFrame({'Col1': ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
                   'Col2': ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
                   'Col3': np.random.random(5)})

What is the best way to return the unique values of 'Col1' and 'Col2'?

The desired output is

'Bob', 'Joe', 'Bill', 'Mary', 'Steve'

See also unique combinations of values in selected columns in pandas data frame and count for a different but related question. The selected answer there uses df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'}) — Paul Rougieux
– Paul Rougieux, Commented Jun 20, 2019 at 9:34

Poe Dator · Accepted Answer · 2020-11-23 15:23:14Z

303

pd.unique returns the unique values from an input array, or DataFrame column or index.

The input to this function needs to be one-dimensional, so multiple columns will need to be combined. The simplest way is to select the columns you want and then view the values in a flattened NumPy array. The whole operation looks like this:

>>> pd.unique(df[['Col1', 'Col2']].values.ravel('K'))
array(['Bob', 'Joe', 'Bill', 'Mary', 'Steve'], dtype=object)

Note that ravel() is an array method that returns a view (if possible) of a multidimensional array. The argument 'K' tells the method to flatten the array in the order the elements are stored in the memory (pandas typically stores underlying arrays in Fortran-contiguous order; columns before rows). This can be significantly faster than using the method's default 'C' order.

An alternative way is to select the columns and pass them to np.unique:

>>> np.unique(df[['Col1', 'Col2']].values)
array(['Bill', 'Bob', 'Joe', 'Mary', 'Steve'], dtype=object)

There is no need to use ravel() here as the method handles multidimensional arrays. Even so, this is likely to be slower than pd.unique as it uses a sort-based algorithm rather than a hashtable to identify unique values.

The difference in speed is significant for larger DataFrames (especially if there are only a handful of unique values):

>>> df1 = pd.concat([df]*100000, ignore_index=True) # DataFrame with 500000 rows
>>> %timeit np.unique(df1[['Col1', 'Col2']].values)
1 loop, best of 3: 1.12 s per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel('K'))
10 loops, best of 3: 38.9 ms per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel()) # ravel using C order
10 loops, best of 3: 49.9 ms per loop

edited Nov 23, 2020 at 15:23

Poe Dator

4,8522 gold badges20 silver badges43 bronze badges

answered Nov 17, 2014 at 16:42

Alex Riley

178k46 gold badges274 silver badges247 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Lisle Over a year ago

How do you get a dataframe back instead of an array?

Alex Riley Over a year ago

@Lisle: both methods return a NumPy array, so you'll have to construct it manually, e.g., pd.DataFrame(unique_values). There's no good way to get back a DataFrame directly.

Ash Upadhyay Over a year ago

@Lisle since he has used pd.unique it returns a numpy.ndarray as a final output. Is this what you were asking?

tickly potato Over a year ago

@Lisle, maybe this one df = df.drop_duplicates(subset=['C1','C2','C3'])?

andrnev Over a year ago

To get only the columns you need into a dataframe you could do df.groupby(['C1', 'C2', 'C3']).size().reset_index().drop(columns=0). This will do a group by which will by default pick the unique combinations and calculate the count of items per group The reset_index will change from multi-index to flat 2 dimensional. And the end is to remove the count of items column.

jtlz2 · Accepted Answer · 2023-11-15 21:35:34Z

19

I have set up a DataFrame with a few simple strings in its columns:

>>> df
   a  b
0  a  g
1  b  h
2  d  a
3  e  e

You can concatenate the columns you are interested in and call unique function:

>>> pandas.concat([df['a'], df['b']]).unique()
array(['a', 'b', 'd', 'e', 'g', 'h'], dtype=object)

edited Nov 15, 2023 at 21:35

jtlz2

8,53511 gold badges74 silver badges128 bronze badges

answered Nov 17, 2014 at 16:30

Mike

7,2634 gold badges31 silver badges51 bronze badges

1 Comment

sixtyfootersdude Over a year ago

This doesn't work when you have something like this this_is_uniuqe = { 'col1': ["Hippo", "H"], "col2": ["potamus", "ippopotamus"], }

James Little · Accepted Answer · 2014-11-17 16:31:42Z

12

In [5]: set(df.Col1).union(set(df.Col2))
Out[5]: {'Bill', 'Bob', 'Joe', 'Mary', 'Steve'}

Or:

set(df.Col1) | set(df.Col2)

answered Nov 17, 2014 at 16:31

James Little

2,06914 silver badges12 bronze badges

Comments

erikreed · Accepted Answer · 2017-08-18 01:56:56Z

10

An updated solution using numpy v1.13+ requires specifying the axis in np.unique if using multiple columns, otherwise the array is implicitly flattened.

import numpy as np

np.unique(df[['col1', 'col2']], axis=0)

This change was introduced Nov 2016: https://github.com/numpy/numpy/commit/1f764dbff7c496d6636dc0430f083ada9ff4e4be

answered Aug 18, 2017 at 1:56

erikreed

1,5691 gold badge16 silver badges24 bronze badges

Comments

Lisle · Accepted Answer · 2018-06-06 20:26:09Z

4

for those of us that love all things pandas, apply, and of course lambda functions:

df['Col3'] = df[['Col1', 'Col2']].apply(lambda x: ''.join(x), axis=1)

answered Jun 6, 2018 at 20:26

Lisle

1,7002 gold badges17 silver badges22 bronze badges

Comments

muon · Accepted Answer · 2019-03-21 20:01:35Z

3

here's another way


import numpy as np
set(np.concatenate(df.values))

answered Mar 21, 2019 at 20:01

muon

14.2k13 gold badges74 silver badges94 bronze badges

Comments

Mykola Zotko · Accepted Answer · 2023-08-29 06:54:02Z

3

You can use stack to combine multiple columns and drop_duplicates to find unique values:

df[['Col1', 'Col2']].stack().drop_duplicates().tolist()

Output:

['Bob', 'Joe', 'Steve', 'Bill', 'Mary']

edited Aug 29, 2023 at 6:54

answered Aug 28, 2023 at 17:56

Mykola Zotko

18.2k6 gold badges87 silver badges90 bronze badges

Comments

WGS · Accepted Answer · 2014-11-17 16:30:14Z

2

Non-pandas solution: using set().

import pandas as pd
import numpy as np

df = pd.DataFrame({'Col1' : ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
              'Col2' : ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
               'Col3' : np.random.random(5)})

print df

print set(df.Col1.append(df.Col2).values)

Output:

   Col1   Col2      Col3
0   Bob    Joe  0.201079
1   Joe  Steve  0.703279
2  Bill    Bob  0.722724
3  Mary    Bob  0.093912
4   Joe  Steve  0.766027
set(['Steve', 'Bob', 'Bill', 'Joe', 'Mary'])

answered Nov 17, 2014 at 16:30

WGS

14.2k5 gold badges50 silver badges51 bronze badges

Comments

Dudelstein · Accepted Answer · 2024-01-24 14:06:21Z

2

A variation on @Mykola Zotko's answer using stack and unique is the most intuitive to me:

df[['Col1', 'Col2']].stack().unique()

answered Jan 24, 2024 at 14:06

Dudelstein

7147 silver badges21 bronze badges

1 Comment

Roman Luštrik Over a year ago

I hope OP comes back and marks this as the preferred answer. Getting unique values from a DataFrame should not be a programming challenge judging by some answers concocted here.

Abdullah A. Mostafa · Accepted Answer · 2023-04-24 23:20:53Z

1

df = pd.DataFrame({'Col1': ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
               'Col2': ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
               'Col3': np.random.random(5)})

If your question is how to get the unique values of each column individually?

Sort the column labels in a list

column_labels = ['Col1', 'Col2']

Create an empty dict

unique_dict = {}

Iterate over selected columns to get their unique values

for column_label in column_labels: 
    unique_values = df[column_label].unique()
    unique_dict.update({column_label: unique_values})
unique_ser = pd.Series(unique_dict)
print(unique_ser)

answered Apr 24, 2023 at 23:20

Abdullah A. Mostafa

111 bronze badge

Comments

smishra · Accepted Answer · 2019-02-19 15:18:11Z

0

list(set(df[['Col1', 'Col2']].as_matrix().reshape((1,-1)).tolist()[0]))

The output will be ['Mary', 'Joe', 'Steve', 'Bob', 'Bill']

answered Feb 19, 2019 at 15:18

smishra

3,48834 silver badges33 bronze badges

2 Comments

arilwan Over a year ago

DataFrame object has no attribute as_matrix.

smishra Over a year ago

Depending on which version you are using. Please see pandas.pydata.org/pandas-docs/version/0.25.1/reference/api/…

BSalita · Accepted Answer · 2023-01-19 17:23:03Z

0

Get a list of unique values given a list of column names:

cols = ['col1','col2','col3','col4']
unique_l = pd.concat([df[col] for col in cols]).unique()

answered Jan 19, 2023 at 17:23

BSalita

9,07111 gold badges59 silver badges75 bronze badges

Comments

S abaranji · Accepted Answer · 2023-01-24 23:35:55Z

-1

import pandas as pd
df= pd.DataFrame({'col1':["a","a","b","c","c","d"],'col2': 
                ["x","x","y","y","z","w"],'col3':[1,2,2,3,4,2]})
df

output is

  col1 col2 col3
0   a   x   1
1   a   x   2
2   b   y   2
3   c   y   3
4   c   z   4
5   d   w   2

to get the unique values from all the columns

    a={}
    for i in range(df.shape[1]) :
        j=df.columns[i]
        a[j] = df.iloc[:,i].unique()

   for p,q in a.items():
       print( f"unique value in {p} are {list(q)} ")

ouput is

    unique value in col1 are ['a', 'b', 'c', 'd'] 
    unique value in col2 are ['x', 'y', 'z', 'w'] 
    unique value in col3 are [1, 2, 3, 4]

answered Jan 24, 2023 at 23:35

S abaranji

1

Collectives™ on Stack Overflow

pandas unique values multiple columns

13 Answers 13

5 Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

13 Answers 13

5 Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related