3

I am looking to use numpy.unique to obtain the reverse unique indexes of two columns of a pandas.DataFrame.

I know how to use it on one column:

u, rev = numpy.unique(df[col], return_inverse=True)

But I want to use it on multiple columns. For example, if the df looks like:

    0   1   
0   1   1
1   1   2
2   2   1
3   2   1
4   3   1

then I would like to get the reverse indexes:

[0,1,2,2,3]

2 Answers 2

2

Approach #1

Here's one NumPy approach converting each row to a scalar each thinking of each row as one indexing tuple on a two-dimensional (for 2 columns of data) grid -

def unique_return_inverse_2D(a): # a is array
    a1D = a.dot(np.append((a.max(0)+1)[:0:-1].cumprod()[::-1],1))
    return np.unique(a1D, return_inverse=1)[1]

If you have negative numbers in the data, we need to use min too to get those scalars. So, in that case, use a.max(0) - a.min(0) + 1 in place of a.max(0) + 1.

Approach #2

Here's another NumPy's views based solution with focus on performance inspired by this smart solution by @Eric -

def unique_return_inverse_2D_viewbased(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:])))
    return np.unique(a.view(void_dt).ravel(), return_inverse=1)[1]

Sample runs -

In [209]: df
Out[209]: 
    0   1   2   3
0  21   7  31  69
1  62  75  22  62  # ----|
2  16  46   9  31  #     |==> Identical rows, so must have same IDs
3  62  75  22  62  # ----|
4  24  12  88  15

In [210]: unique_return_inverse_2D(df.values)
Out[210]: array([1, 3, 0, 3, 2])

In [211]: unique_return_inverse_2D_viewbased(df.values)
Out[211]: array([1, 3, 0, 3, 2])
Sign up to request clarification or add additional context in comments.

Comments

1

I think you can convert columns to strings and then sum:

u, rev = np.unique(df.astype(str).values.sum(axis=1), return_inverse=True)
print (rev)
[0 1 2 2 3]

As pointed DSM (thank you), it is dangerous.

Another solution is convert rows to tuples:

u, rev = np.unique(df.apply(tuple, axis=1), return_inverse=True)
print (rev)
[0 1 2 2 3]

1 Comment

Dangerous. This would be unable to distiguish the row 11,2 from 1,12.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.