1

While learning to create a hashmap using Numpy and Python 3, I came up with the following code which uses a Numpy structured array data.

However, the time it takes to select a value from a key is quite slow, as shown in the timeit runs comparing 13.3 secs for Numpy structured array data with 0.008 secs for Python dictionary d.

val = data[data['keys'] == key]['values'][0]

Is there a faster way to get the item for a particular key?

import numpy as np
import timeit

N = 1000*1000
keyArr = np.random.randint(0, 1000*1000*1000*4, N)
valArr = np.random.rand(N)
key = keyArr[0]                                     # Select an existing key value

# Numpy structured array
data = np.empty(keyArr.shape[0], dtype=[('keys', keyArr.dtype), ('values', valArr.dtype)])
data['keys'] = keyArr
data['values'] = valArr

val = data[data['keys'] == key]['values'][0]
print(key, '=>', val)                               # 558520981 => 0.17948995177905835
print( timeit.Timer("data[data['keys'] == key]['values'][0]", 
    globals=globals()).timeit(10*1000) , 'secs' )   # 13.256318201000001 secs

# Python built-in dictionary
d = {}
for k, v in zip(keyArr, valArr):
    d[k] = v

print(key, '=>', d[key])                            # 558520981 => 0.17948995177905835
print( timeit.Timer("d[key]",       
    globals=globals()).timeit(10*1000) , 'secs' )   # 0.0008061910000000116 secs

Numpy Lookup Method 1: 13.3 secs

val = data[data['keys'] == key]['values'][0]

Numpy Lookup Method 2: 13.4 secs

val = data['values'][np.where(data['keys'] == key)][0]

pandas.Series: 6.8 secs

import pandas as pd

# Pandas Series
s = pd.Series(valArr, index=keyArr, dtype=valArr.dtype)
val = s[key]
print(key, '=>', val)
print( timeit.Timer("s[key]", 
    globals=globals()).timeit(10*1000) , 'secs' )   # 6.782590246000002 secs
7
  • have you looked into pandas? Commented Sep 4, 2020 at 18:28
  • @RichieV No, I assumed pandas will be slower than numpy since it is built on top of numpy and with additional features. Commented Sep 4, 2020 at 18:43
  • 1
    It is definitely slower on normal numpy operations, but it is optimized for this kind of methods, like filtering with row/column labels Commented Sep 4, 2020 at 18:45
  • @RichieV Good point, let me give pandas a try and update the question with my findings. I'm guessing pandas.Series will be suitable. Commented Sep 4, 2020 at 18:48
  • @RichieV Updated the question with the results when using pandas. The pandas.Series has a 2X faster lookup than the numpy structured array, but it is still about 1000X slower than Python's dictionary Commented Sep 4, 2020 at 18:55

1 Answer 1

1

The main source of the problem is that lookup operations like these of numpy and pandas need to check every element in the list, as they are intended to perform multiple selection and more complex lookup operations, too. However, python dictionary can only perform single match lookups, but it's an optimal implementation with binary trees.

So, if your intention is to stick to key access, I don't think you'll find anything faster than a dictionary. Otherwise, I'd put my bet on pandas for the fastest access times.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.