Motivation
I'm often answering questions in which I'm advocating converting dataframe values to an underlying numpy array for quicker calculations. However, there are some caveats to doing this and some ways that are better than others.
I'll be providing my own answer in an effort to give back to the community. I hope you guys find it useful.
Problem
Consider the dataframe df
df = pd.DataFrame(dict(A=[1, 2, 3], B=list('xyz'), C=[9, 8, 7], D=[4, 5, 6]))
print(df)
A B C D
0 1 x 9 4
1 2 y 8 5
2 3 z 7 6
with dtypes
print(df.dtypes)
A int64
B object
C int64
D int64
dtype: object
I want to create a numpy array a that consist of the values from columns A and C. Assume that there could be many columns and that I'm targeting two specific columns A and C
What I've tried
I could do:
df[['A', 'C']].values
array([[1, 9],
[2, 8],
[3, 7]])
This is accurate!
However, I can do it quicker with numpy
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p]
array([[1, 9],
[2, 8],
[3, 7]], dtype=object)
This is quicker, but inaccurate. Notice the dtype=object. I need integers!.
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p].astype(int)
array([[1, 9],
[2, 8],
[3, 7]])
This is now correct, but I may not have known that I had all integers.
Timing
# Clear and accurate, but slower
%%timeit
df[['A', 'C']].values
1000 loops, best of 3: 347 µs per loop
# Not accurate, but close and fast
%%timeit
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p]
10000 loops, best of 3: 59.2 µs per loop
# Accurate for this test case and fast, needs to be more generalized.
%%timeit
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p].astype(int)
10000 loops, best of 3: 59.3 µs per loop