30

Input

I have many numpy structured arrays in a list like this example:

import numpy

a1 = numpy.array([(1, 2), (3, 4), (5, 6)], dtype=[('x', int), ('y', int)])

a2 = numpy.array([(7,10), (8,11), (9,12)], dtype=[('z', int), ('w', float)])

arrays = [a1, a2]

Desired Output

What is the correct way to join them all together to create a unified structured array like the following?

desired_result = numpy.array([(1, 2, 7, 10), (3, 4, 8, 11), (5, 6, 9, 12)],
                             dtype=[('x', int), ('y', int), ('z', int), ('w', float)])

Current Approach

This is what I'm currently using, but it is very slow, so I suspect there must be a more efficent way.

from numpy.lib.recfunctions import append_fields

def join_struct_arrays(arrays):
    for array in arrays:
        try:
            result = append_fields(result, array.dtype.names, [array[name] for name in array.dtype.names], usemask=False)
        except NameError:
            result = array

    return result

4 Answers 4

48

You can also use the function merge_arrays of numpy.lib.recfunctions:

import numpy.lib.recfunctions as rfn
rfn.merge_arrays(arrays, flatten = True, usemask = False)

Out[52]: 
array([(1, 2, 7, 10.0), (3, 4, 8, 11.0), (5, 6, 9, 12.0)], 
     dtype=[('x', '<i4'), ('y', '<i4'), ('z', '<i4'), ('w', '<f8')])
Sign up to request clarification or add additional context in comments.

1 Comment

This is more readable and 1.32 times faster than my original solution. Thanks!
20

Here is an implementation that should be faster. It converts everything to arrays of numpy.uint8 and does not use any temporaries.

def join_struct_arrays(arrays):
    sizes = numpy.array([a.itemsize for a in arrays])
    offsets = numpy.r_[0, sizes.cumsum()]
    n = len(arrays[0])
    joint = numpy.empty((n, offsets[-1]), dtype=numpy.uint8)
    for a, size, offset in zip(arrays, sizes, offsets):
        joint[:,offset:offset+size] = a.view(numpy.uint8).reshape(n,size)
    dtype = sum((a.dtype.descr for a in arrays), [])
    return joint.ravel().view(dtype)

Edit: Simplified the code and avoided the unnecessary as_strided().

7 Comments

This is 166 times faster than my original solution. I would have never come up with that on my own. Thanks!
@Jon-Eric: I simplified the code a bit (and threw out as_strided()). I hope this did not affect the performance. Also be sure to have a look at joris' second answer.
@Jon-Eric: Did you say it was 166 times faster, not 1.66 times faster? Just want to confirm.
@Hans Yes, 166 times faster. (When I originally measured.)
@Hans This is a rather low-level approach which basically just copies the old arrays as blocks of memory, completely ignoring their structure. It's clear why this can be done rather quickly, but to understand why the original solution is so slow you need to profile the code. It could be for many reasons, including the one you mentioned.
|
8

and yet another way, a little more readable and also a lot faster I think:

def join_struct_arrays(arrays):
    newdtype = []
    for a in arrays:
        descr = []
        for field in a.dtype.names:
            (typ, _) = a.dtype.fields[field]
            descr.append((field, typ))
        newdtype.extend(tuple(descr))
    newrecarray = np.zeros(len(arrays[0]), dtype = newdtype)
    for a in arrays:
        for name in a.dtype.names:
            newrecarray[name] = a[name]
    return newrecarray

EDIT: with the suggestions of Sven it becomes (a little bit slower, but actually pretty readable):

def join_struct_arrays2(arrays):
    newdtype = sum((a.dtype.descr for a in arrays), [])
    newrecarray = np.empty(len(arrays[0]), dtype = newdtype)
    for a in arrays:
        for name in a.dtype.names:
            newrecarray[name] = a[name]
    return newrecarray

4 Comments

Nice, +1! Two suggestions: 1. Use numpy.empty() instead of numpy.zeros() -- it's not necessary to initialise the data. 2. Substitute the first seven lines by the last but one line of my code.
Thanks! That really simplified the code. But on the otherhand, I tested it with %timeit in IPython, and by substituting these 7 lines by your last but one line, it was two times slower. And I also compared it with your solution, and it appeared around 5 times slower than mine. But I guess that when the number of elements in the list of arrays increases, your solution will become better?
To get meaningful timings, you'd need to use big arrays. And I would expect your solution to be at least on par with mine as far as performance is concerned. Note that using empty() instead of zeros() should speed things up a bit.
I think it's wrong to use dtype.descr. According to the documentation: > Warning: This attribute exists specifically for PEP3118 compliance, and is not a datatype description compatible with np.dtype.
1
def join_struct_arrays(*arrs):
    dtype = [(name, d[0]) for arr in arrs for name, d in arr.dtype.fields.items()]
    r = np.empty(arrs[0].shape, dtype=dtype)
    for a in arrs:
       for name in a.dtype.names:
           r[name] = a[name]
    return r

1 Comment

maybe the for loop can be improved, but it’s the fastest at this moment

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.