14

I am struggling with the basic task of constructing a DataFrame of counts by value from a tuple produced by np.unique(arr, return_counts=True), such as:

import numpy as np
import pandas as pd

np.random.seed(123)  
birds=np.random.choice(['African Swallow','Dead Parrot','Exploding Penguin'], size=int(5e4))
someTuple=np.unique(birds, return_counts = True)
someTuple
#(array(['African Swallow', 'Dead Parrot', 'Exploding Penguin'], 
#       dtype='<U17'), array([16510, 16570, 16920], dtype=int64))

First I tried

pd.DataFrame(list(someTuple))
# Returns this:
#                  0            1                  2
# 0  African Swallow  Dead Parrot  Exploding Penguin
# 1            16510        16570              16920

I also tried pd.DataFrame.from_records(someTuple), which returns the same thing.

But what I'm looking for is this:

#              birdType      birdCount
# 0     African Swallow          16510  
# 1         Dead Parrot          16570  
# 2   Exploding Penguin          16920

What's the right syntax?

1
  • your second method would have been close with additional '.T' functionality: pd.DataFrame.from_records(someTuple).T Commented Aug 23, 2016 at 19:18

4 Answers 4

7

Here's one NumPy based solution with np.column_stack -

pd.DataFrame(np.column_stack(someTuple),columns=['birdType','birdCount'])

Or with np.vstack -

pd.DataFrame(np.vstack(someTuple).T,columns=['birdType','birdCount'])

Benchmarking np.transpose, np.column_stack and np.vstack for staking 1D arrays into columns to form a 2D array -

In [54]: tup1 = (np.random.rand(1000),np.random.rand(1000))

In [55]: %timeit np.transpose(tup1)
100000 loops, best of 3: 15.9 µs per loop

In [56]: %timeit np.column_stack(tup1)
100000 loops, best of 3: 11 µs per loop

In [57]: %timeit np.vstack(tup1).T
100000 loops, best of 3: 14.1 µs per loop
Sign up to request clarification or add additional context in comments.

2 Comments

These are both very fast numpy solutions, just what I was looking for. An equally fast answer was pd.DataFrame(np.transpose(someTuple), columns=['birdType', 'birdCount']) which another user gave but then deleted (not sure why).
@C8H10N4O2 Added some timings on those three, all look equally fast it seems.
5

create a dictionary

pd.DataFrame(dict(birdType=someTuple[0], birdCount=someTuple[1]))

enter image description here

2 Comments

Nice. I need to start using the plain dictionary constructor with keyword arguments more often. It really is very convenient.
Pining for the fjords!
4

Using your tuple, you can do the following:

In [4]: pd.DataFrame(list(zip(*someTuple)), columns = ['Bird', 'BirdCount'])
Out[4]: 
                Bird  BirdCount
0    African Swallow      16510
1        Dead Parrot      16570
2  Exploding Penguin      16920

Comments

2

You could use Counter.

from collections import Counter

c = Counter(birds)

>>> pd.Series(c)
African Swallow      16510
Dead Parrot          16570
Exploding Penguin    16920
dtype: int64

You could also use value_counts on the series.

>>> pd.Series(birds).value_counts()
Exploding Penguin    16920
Dead Parrot          16570
African Swallow      16510
dtype: int64

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.