Two Columns of a pandas dataframe - Concat in Python

Question

New to pandas python.

I have a dataframe (df) with two columns of cusips. I want to turn those columns into a list of the unique entries of the two columns.

My first attempt was to do the following:

cusips = pd.concat(df['long'], df['short']).

This returned the error: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

I have read a few postings, but I am still having trouble with why this comes up. What am I missing here?

Also, what's the most efficient way to select the unique entries in a column or a dataframe? Can I call it in one function? Does the function differ if I want to create a list or a new, one-coulmn dataframe?

Thank you.

Zelazny7 · Accepted Answer · 2013-01-02 16:01:16Z

1

Adding to Hayden's answer, you could also use the set() method for the same result. The performance is slightly better if that's a consideration:

In [28]: %timeit set(np.append(df[0],df[1]))
100000 loops, best of 3: 19.6 us per loop

In [29]: %timeit np.append(df[0].unique(), df[1].unique())
10000 loops, best of 3: 55 us per loop

answered Jan 2, 2013 at 16:01

Zelazny7

40.7k18 gold badges72 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Andy Hayden Over a year ago

set is definitely the way to go here, I am upset it is faster than numpy.unique (which sorts)!

user1911092 Over a year ago

Quick follow-up. Does Set take all or just the unique of df[0] and df[1]. I am assuming the unique only. Any thoughts on this: This returned the error: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). What should I make of that?

Zelazny7 Over a year ago

Try enclosing your Series in []: cusips = pd.concat([df['long'], df['short']])

Community · Accepted Answer · 2017-05-23 12:14:16Z

To obtain the unique values in a column you can use the unique Series method, which will return a numpy array of the unique values (and it is fast!).

df.long.unique()
# returns numpy array of unique values

You could then use numpy.append:

np.append(df.long.unique(), df.short.unique())

Note: This just appends the two unique results together and so itself is not unique!

.

Here's a (trivial) example:

import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 2], [1, 4]], columns=['long','short'])

In [4]: df
Out[4]: 
   long  short
0     1      2
1     1      4

In [5]: df.long.unique()
Out[5]: array([1])

In [6]: df.short.unique()
Out[6]: array([2, 4])

And then appending the resulting two arrays:

In [7]: np.append(df.long.unique(), df.short.unique())
Out[7]: array([1, 2, 4])

Using @Zalazny7's set is significantly faster (since it runs over the array only once) and somewhat upsettingly it's even faster than np.unique (which sorts the resulting array!).

Collectives™ on Stack Overflow

Two Columns of a pandas dataframe - Concat in Python

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related