2

The following code constructs a NumPy array with a dtype object:

dt = np.dtype([
    ("index", np.int32),
    ("timestamp", np.int32),
    ("volume", np.float32)
])

arr = np.array([
    [0, 20, 3],
    [1, 21, 2],
    [2, 23, 8],
    [3, 26, 5],
    [4, 31, 9]
]).astype(dt)

The expected result of arr would be:

>>> arr
array([[  0,  20, 334.],
       [  1,  21, 254.],
       [  2,  23, 823.],
       [  3,  26, 521.],
       [  4,  31, 943.]])

>>> arr[0]
array([  0,  20, 334.])

But what the code above is creating is actually this:

>>> arr
array([[(  0,   0,   0.), ( 20,  20,  20.), (334, 334, 334.)],
       [(  1,   1,   1.), ( 21,  21,  21.), (254, 254, 254.)],
       [(  2,   2,   2.), ( 23,  23,  23.), (823, 823, 823.)],
       [(  3,   3,   3.), ( 26,  26,  26.), (521, 521, 521.)],
       [(  4,   4,   4.), ( 31,  31,  31.), (943, 943, 943.)]],
      dtype=[('index', '<i4'), ('timestamp', '<i4'), ('volume', '<f4')])

>>> arr[0]
array([(  0,   0,   0.), ( 20,  20,  20.), (334, 334, 334.)],
      dtype=[('index', '<i4'), ('timestamp', '<i4'), ('volume', '<f4')])

Why is NumPy creating a version of every value for every data type instead of mapping each column to its own data type (and only this one)? I'm guessing that I did something wrong there. Is there a way to get to the result I was expecting?

1 Answer 1

2

The issue here is that for the structured array creation you need a list of tuples. This is mentioned in Structured Datatype Creation, where it states that among other less common methods of array creation, the input data must be a list of tuples, one tuple per field.

So what you can do is turn your array into a list of tuples (zip will be convenient here) and build the structured array from it using np.fromiter and specifying dt as dtype:

np.fromiter(zip(*arr.T), dtype=dt)
array([(0, 20, 3.), (1, 21, 2.), (2, 23, 8.), (3, 26, 5.), (4, 31, 9.)],
      dtype=[('index', '<i4'), ('timestamp', '<i4'), ('volume', '<f4')])

Another (lesser known) approach as mentioned by @hpaulj in the comments, is using np.lib.recfunctions.unstructured_to_structured, which can be used to directly construct the structured array from arr and the dtype object with:

np.lib.recfunctions.unstructured_to_structured(a, dt)
array([(0, 20, 3.), (1, 21, 2.), (2, 23, 8.), ..., (2, 23, 8.),
       (3, 26, 5.), (4, 31, 9.)],
      dtype=[('index', '<i4'), ('timestamp', '<i4'), ('volume', '<f4')])

Or based on this other post there's also the possibility to create a record array, an ndarray subclass, very similar to a structured array in terms of usage, that comes with several associated helper functions, such as np.core.records.fromarrays that can be used for the creation of the array as in a simple way:

np.core.records.fromarrays(arr.T, 
                           names='index, timestamp, volume', 
                           formats = '<i4, <i4, <f4')
rec.array([(0, 20, 3.), (1, 21, 2.), (2, 23, 8.), (3, 26, 5.),
           (4, 31, 9.)],
          dtype=[('index', '<i4'), ('timestamp', '<i4'), ('volume', '<f4')])

Or to create it from the np.dtype object:

names, dtypes = list(zip(*dt.descr))
np.core.records.fromarrays(arr.transpose(), 
                           names= ', '.join(names), 
                           formats = ', '.join(dtypes))

Timings comparing the mentioned methods, and some other possible approaches:

a = np.concatenate([arr]*1000, axis=0)

%%timeit 
np.core.records.fromarrays(a.T, 
                           names='index, timestamp, volume', 
                           formats = '<i4, <i4, <f4')
# 57.9 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.lib.recfunctions.unstructured_to_structured(a, dt)
# 79.6 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.fromiter(zip(*a.T), dtype=dt)
#2.1 ms ± 69.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.fromiter(map(tuple, a), dtype=dt)
#6.34 ms ± 65.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.array(list(zip(*a.T)), dtype=dt)
# 2.17 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sign up to request clarification or add additional context in comments.

8 Comments

ah! but then you'd lose the ability to do things like isolate columns quickly with arr[:, 0] for instance
I mean, you can use arr["index"] but I'm wondering if, when using tuples like this, performance would be equivalent as the "pure array" form
You can index on the names, such as a['index'] @jivan
Well I'm unsure tbh until what point working with structured arrays is optimized in numpy as opposed to regular ndarrays, but I'd guess that performance does worsen @jivan
yes, that's what I'm thinking as well. Gonna stick to regular ndarrays for now, even if that mean having columns which should be np.int8 being cast into np.float64...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.