Faster Way to Lookup Values in Numpy Structured Array

Question

While learning to create a hashmap using Numpy and Python 3, I came up with the following code which uses a Numpy structured array data.

However, the time it takes to select a value from a key is quite slow, as shown in the timeit runs comparing 13.3 secs for Numpy structured array data with 0.008 secs for Python dictionary d.

val = data[data['keys'] == key]['values'][0]

Is there a faster way to get the item for a particular key?

import numpy as np
import timeit

N = 1000*1000
keyArr = np.random.randint(0, 1000*1000*1000*4, N)
valArr = np.random.rand(N)
key = keyArr[0]                                     # Select an existing key value

# Numpy structured array
data = np.empty(keyArr.shape[0], dtype=[('keys', keyArr.dtype), ('values', valArr.dtype)])
data['keys'] = keyArr
data['values'] = valArr

val = data[data['keys'] == key]['values'][0]
print(key, '=>', val)                               # 558520981 => 0.17948995177905835
print( timeit.Timer("data[data['keys'] == key]['values'][0]", 
    globals=globals()).timeit(10*1000) , 'secs' )   # 13.256318201000001 secs

# Python built-in dictionary
d = {}
for k, v in zip(keyArr, valArr):
    d[k] = v

print(key, '=>', d[key])                            # 558520981 => 0.17948995177905835
print( timeit.Timer("d[key]",       
    globals=globals()).timeit(10*1000) , 'secs' )   # 0.0008061910000000116 secs

Numpy Lookup Method 1: 13.3 secs

val = data[data['keys'] == key]['values'][0]

Numpy Lookup Method 2: 13.4 secs

val = data['values'][np.where(data['keys'] == key)][0]

pandas.Series: 6.8 secs

import pandas as pd

# Pandas Series
s = pd.Series(valArr, index=keyArr, dtype=valArr.dtype)
val = s[key]
print(key, '=>', val)
print( timeit.Timer("s[key]", 
    globals=globals()).timeit(10*1000) , 'secs' )   # 6.782590246000002 secs

@RichieV No, I assumed pandas will be slower than numpy since it is built on top of numpy and with additional features. — Athena Wisdom
– Athena Wisdom, Commented Sep 4, 2020 at 18:43
It is definitely slower on normal numpy operations, but it is optimized for this kind of methods, like filtering with row/column labels — RichieV
– RichieV, Commented Sep 4, 2020 at 18:45
@RichieV Good point, let me give pandas a try and update the question with my findings. I'm guessing pandas.Series will be suitable. — Athena Wisdom
– Athena Wisdom, Commented Sep 4, 2020 at 18:48
@RichieV Updated the question with the results when using pandas. The pandas.Series has a 2X faster lookup than the numpy structured array, but it is still about 1000X slower than Python's dictionary — Athena Wisdom
– Athena Wisdom, Commented Sep 4, 2020 at 18:55

David · Accepted Answer · 2020-09-04 19:03:22Z

1

The main source of the problem is that lookup operations like these of numpy and pandas need to check every element in the list, as they are intended to perform multiple selection and more complex lookup operations, too. However, python dictionary can only perform single match lookups, but it's an optimal implementation with binary trees.

So, if your intention is to stick to key access, I don't think you'll find anything faster than a dictionary. Otherwise, I'd put my bet on pandas for the fastest access times.

answered Sep 4, 2020 at 19:03

David

4631 gold badge4 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Faster Way to Lookup Values in Numpy Structured Array

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related