How do I efficiently get a numpy array for a subset of columns from my dataframe?

Question

Motivation

I'm often answering questions in which I'm advocating converting dataframe values to an underlying numpy array for quicker calculations. However, there are some caveats to doing this and some ways that are better than others.

I'll be providing my own answer in an effort to give back to the community. I hope you guys find it useful.

Problem
Consider the dataframe df

df = pd.DataFrame(dict(A=[1, 2, 3], B=list('xyz'), C=[9, 8, 7], D=[4, 5, 6]))
print(df)

   A  B  C  D
0  1  x  9  4
1  2  y  8  5
2  3  z  7  6

with dtypes

print(df.dtypes)

A     int64
B    object
C     int64
D     int64
dtype: object

I want to create a numpy array a that consist of the values from columns A and C. Assume that there could be many columns and that I'm targeting two specific columns A and C

What I've tried

I could do:

df[['A', 'C']].values

array([[1, 9],
       [2, 8],
       [3, 7]])

This is accurate!

However, I can do it quicker with numpy

p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p]

array([[1, 9],
       [2, 8],
       [3, 7]], dtype=object)

This is quicker, but inaccurate. Notice the dtype=object. I need integers!.

p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p].astype(int)

array([[1, 9],
       [2, 8],
       [3, 7]])

This is now correct, but I may not have known that I had all integers.

Timing

# Clear and accurate, but slower
%%timeit 
df[['A', 'C']].values
1000 loops, best of 3: 347 µs per loop

# Not accurate, but close and fast
%%timeit 
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p]
10000 loops, best of 3: 59.2 µs per loop

# Accurate for this test case and fast, needs to be more generalized.
%%timeit 
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p].astype(int)
10000 loops, best of 3: 59.3 µs per loop

piRSquared · Accepted Answer · 2017-05-31 04:51:43Z

4

pandas does not store a single array for the entire dataframe in the values attribute. When you call the values attribute on a dataframe, it builds the array from the underlying objects that is does store, namely the pd.Series objects. It's useful to think of a dataframe as a pd.Series of pd.Series where each column is one such pd.Series that the dataframe contains. Each column can have a dtype that is different from the rest. That is part of why dataframes are so useful. However, a numpy array must have one type. When we call the values attribute on a dataframe, it goes to each column and pulls the data from each of the respective values attributes and cobbles them together. If the columns respective dtypes are inconsistent, then the dtype of the resulting array will be forced to be object.

Option 1
Slow but accurate

a = df[['A', 'C']].values

The reason this is slow is because you are asking pandas to build you a new dataframe df[['A', 'C']] then go and build the array a by hitting each of the new dataframe's columns' values attribute.

Option 2
Find column positions then slice values

c = ['A', 'C']
p = [df.columns.get_loc(i) for i in c]
a = df.values[:, p].astype(df.dtypes[c[0]])

This is better because we only build the values array without rebuilding a new dataframe. I'm trusting that we are getting an array with consistent dtypes. If up casting needs to happen, I'm not dealing with it well here.

Option 3
My preferred approach
Only access the values of the columns I care about

a = np.column_stack([df[col].values for col in ['A', 'C']])

This leverages the pandas dataframe as a container of pd.Series in which I access the values attribute of only the columns I care about. I then build a new array from those arrays. If casting needs to be addressed, numpy will handle it.

All approaches yield the same result

array([[1, 9],
       [2, 8],
       [3, 7]])

Timing
small data

%%timeit 
a = df[['A', 'C']].values
1000 loops, best of 3: 338 µs per loop

%%timeit 
c = ['A', 'C']
p = [df.columns.get_loc(i) for i in c]
a = df.values[:, p].astype(df.dtypes[c[0]])
10000 loops, best of 3: 166 µs per loop

%timeit np.column_stack([df[col].values for col in ['A', 'C']])
The slowest run took 7.36 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.97 µs per loop

big data

df = pd.concat(
    [df.join(pd.DataFrame(
                np.random.randint(10, size=(3, 22)),
                columns=list(ascii_uppercase[4:])
            ))] * 10000, ignore_index=True
)


%%timeit 
a = df[['A', 'C']].values
The slowest run took 23.28 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 371 µs per loop
In [305]:

%%timeit 
c = ['A', 'C']
p = [df.columns.get_loc(i) for i in c]
a = df.values[:, p].astype(df.dtypes[c[0]])
100 loops, best of 3: 9.62 ms per loop

%timeit np.column_stack([df[col].values for col in ['A', 'C']])
The slowest run took 6.66 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 55.6 µs per loop

edited May 31, 2017 at 4:51

answered May 30, 2017 at 23:39

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

hpaulj Over a year ago

Does a pd series use a numpy array to store its values?

piRSquared Over a year ago

@hpaulj I'll be honest, I can't be certain. But I'm pretty sure it's a yes. @property; def values refers to a _data attribute that I can't track down. But the def __init__ shows the data attribute being assigned a SingleBlockManager

piRSquared Over a year ago

@hpaulj and that is a numpy array... sort of :-)

xmduhan · Accepted Answer · 2017-06-01 01:46:29Z

1

try this:

np.array(zip(df['A'].values, df['C'].values))

timeit:

%%timeit
np.array(zip(df['A'].values, df['C'].values))

The slowest run took 5.51 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 17.8 µs per loop

answered Jun 1, 2017 at 1:46

xmduhan

1,03513 silver badges15 bronze badges

Collectives™ on Stack Overflow

How do I efficiently get a numpy array for a subset of columns from my dataframe?

Motivation

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Motivation

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related