1

New to pandas python.

I have a dataframe (df) with two columns of cusips. I want to turn those columns into a list of the unique entries of the two columns.

My first attempt was to do the following:

cusips = pd.concat(df['long'], df['short']).

This returned the error: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

I have read a few postings, but I am still having trouble with why this comes up. What am I missing here?

Also, what's the most efficient way to select the unique entries in a column or a dataframe? Can I call it in one function? Does the function differ if I want to create a list or a new, one-coulmn dataframe?

Thank you.

2 Answers 2

1

Adding to Hayden's answer, you could also use the set() method for the same result. The performance is slightly better if that's a consideration:

In [28]: %timeit set(np.append(df[0],df[1]))
100000 loops, best of 3: 19.6 us per loop

In [29]: %timeit np.append(df[0].unique(), df[1].unique())
10000 loops, best of 3: 55 us per loop
Sign up to request clarification or add additional context in comments.

3 Comments

set is definitely the way to go here, I am upset it is faster than numpy.unique (which sorts)!
Quick follow-up. Does Set take all or just the unique of df[0] and df[1]. I am assuming the unique only. Any thoughts on this: This returned the error: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). What should I make of that?
Try enclosing your Series in []: cusips = pd.concat([df['long'], df['short']])
1

To obtain the unique values in a column you can use the unique Series method, which will return a numpy array of the unique values (and it is fast!).

df.long.unique()
# returns numpy array of unique values

You could then use numpy.append:

np.append(df.long.unique(), df.short.unique())

Note: This just appends the two unique results together and so itself is not unique!

.

Here's a (trivial) example:

import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 2], [1, 4]], columns=['long','short'])

In [4]: df
Out[4]: 
   long  short
0     1      2
1     1      4

In [5]: df.long.unique()
Out[5]: array([1])

In [6]: df.short.unique()
Out[6]: array([2, 4])

And then appending the resulting two arrays:

In [7]: np.append(df.long.unique(), df.short.unique())
Out[7]: array([1, 2, 4])

Using @Zalazny7's set is significantly faster (since it runs over the array only once) and somewhat upsettingly it's even faster than np.unique (which sorts the resulting array!).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.