Combine two NumPy arrays into one structured array for appending to a PyTables table

Question

I have two unstructured NumPy arrays a and b with shapes (N,) and (N, 256, 2) respectively and dtype np.float. I wish to combine these into a single structured array with shape (N,) and dtype [('field1', np.float), ('field2', np.float, (256, 2))].

The documentation on this is surprisingly lacking. I've found methods like np.lib.recfunctions.merge_arrays but have not been able to find the precise combination of features required to do this.

For the sake of avoiding the XY problem, I'll state my wider aims.

I have a PyTables table with layout {"field1": tables.FloatCol(), "field2": tables.FloatCol(shape = (256, 2))}. The two NumPy arrays represent N new rows to be appended to each of these fields. N is large, so I wish to do this with a single efficient table.append(rows) call, rather than the slow process of looping through table.row['field'] = ....

The table.append documentation says

The rows argument may be any object which can be converted to a structured array compliant with the table structure (otherwise, a ValueError is raised). This includes NumPy structured arrays, lists of tuples or array records, and a string or Python buffer.

Converting my arrays to an appropriate structured array seems to be what I should be doing here. I'm looking for speed, and I anticipate the other options being slower.

I'm not sure whether this question is best phrased in terms of the NumPy problem or the PyTables problem. I've opted for the NumPy problem as it seems more generally applicable and requires less specialist knowledge to answer. A person may be able to answer the NumPy question without knowing PyTables, but not the other way around. I'm open to editing the question to change this emphasis if people think I've made the wrong call. — Sam
– Sam, Commented May 30, 2020 at 14:49
Make a np.zeros structured array with the right shape and dtype, and assign the fields individually, by name. — hpaulj
– hpaulj, Commented May 30, 2020 at 15:10
Ah! That's so obvious when you say it @hpaulj! I read the documentation on structured arrays a bit too fast and got it into my head that it wasn't possible to slice them by field at all, but this works. I think I read something somewhere that was out of date. We can use np.empty instead of np.zeros of course. If you would like to post this as an answer with a minimal code sample I'll be happy to accept it. — Sam
– Sam, Commented May 30, 2020 at 17:25
Recent versions have made a change in the multi-field access; otherwise creating an structured array remains the same. — hpaulj
– hpaulj, Commented May 30, 2020 at 17:45

hpaulj · Accepted Answer · 2020-05-30 18:25:28Z

Define the dtype, and create an empty/zeros array:

In [163]: dt = np.dtype([('field1', np.float), ('field2', np.float, (4, 2))])            
In [164]: arr = np.zeros(3, dt)     # float display is prettier                                                          
In [165]: arr                                                                            
Out[165]: 
array([(0., [[0., 0.], [0., 0.], [0., 0.], [0., 0.]]),
       (0., [[0., 0.], [0., 0.], [0., 0.], [0., 0.]]),
       (0., [[0., 0.], [0., 0.], [0., 0.], [0., 0.]])],
      dtype=[('field1', '<f8'), ('field2', '<f8', (4, 2))])

Assign values field by field:

In [166]: arr['field1'] = np.arange(3)                                                   
In [167]: arr['field2'].shape                                                            
Out[167]: (3, 4, 2)
In [168]: arr['field2'] = np.arange(24).reshape(3,4,2)                                   
In [169]: arr                                                                            
Out[169]: 
array([(0., [[ 0.,  1.], [ 2.,  3.], [ 4.,  5.], [ 6.,  7.]]),
       (1., [[ 8.,  9.], [10., 11.], [12., 13.], [14., 15.]]),
       (2., [[16., 17.], [18., 19.], [20., 21.], [22., 23.]])],
      dtype=[('field1', '<f8'), ('field2', '<f8', (4, 2))])

np.rec does have a function that works similarly:

In [174]: np.rec.fromarrays([np.arange(3.), np.arange(24).reshape(3,4,2)], dtype=dt)     
Out[174]: 
rec.array([(0., [[ 0.,  1.], [ 2.,  3.], [ 4.,  5.], [ 6.,  7.]]),
           (1., [[ 8.,  9.], [10., 11.], [12., 13.], [14., 15.]]),
           (2., [[16., 17.], [18., 19.], [20., 21.], [22., 23.]])],
          dtype=[('field1', '<f8'), ('field2', '<f8', (4, 2))])

This is the same, except fields can be accessed as attributes (as well). Under the covers it does the same by-field assignment.

numpy.lib.recfunctions is another collection of structured array functions. These too mostly follow the by-field assignment approach.

Simple, and very obvious as soon as you pointed out that I could index structured arrays by field like this. A misreading of the docs lead me down a far more complicated path than necessary! I'll be kicking myself over this one. Perfect answer, thanks!

Valdi_Bo · Accepted Answer · 2020-05-30 18:36:52Z

In order to have test printouts of decent size, my solution assumes:

N = 5,
the second dimension - only 4 (instead of your 256).

To generate the result, proceed as follows:

Start from import numpy.lib.recfunctions as rfn (will be needed soon).

Create source arrays:

a = np.array([10, 20, 30, 40, 50])
b = np.arange(1, 41).reshape(5, 4, 2)

Create the result:

result = rfn.unstructured_to_structured(
    np.hstack((a[:,np.newaxis], b.reshape(-1,8))),
    np.dtype([('field1', 'f4'), ('field2', 'f4', (4,2))]))

The generated array contains:

array([(10., [[ 1.,  2.], [ 3.,  4.], [ 5.,  6.], [ 7.,  8.]]),
       (20., [[ 9., 10.], [11., 12.], [13., 14.], [15., 16.]]),
       (30., [[17., 18.], [19., 20.], [21., 22.], [23., 24.]]),
       (40., [[25., 26.], [27., 28.], [29., 30.], [31., 32.]]),
       (50., [[33., 34.], [35., 36.], [37., 38.], [39., 40.]])],
      dtype=[('field1', '<f4'), ('field2', '<f4', (4, 2))])

Note that the source array to unstructured_to_structured is created the following way:

Column 0 - from a (converted to a column),
Remaining colums - from b reshaped in such a way that all elements of the respective 4 * 2 slice are converted to a single row. Data from each row (from these columns) are converted back to "4 * 2" shape by this function.
Both the above components are assembled with hstack.

During the above experiments I assumed type of f4, maybe you should change it to f8 (your decision).

In the target version of the code:

change 4 in the first dimension of field2 to 256,
change 8 in b.reshape to 512 (= 2 * 256).

kcw78 · Accepted Answer · 2020-05-31 21:58:55Z

This answer builds on @hpualj's answer. His first method creates the obj argument as a structured array and his second creates a record array. (This array would be the rows argument when you append.) I like both of these methods to create or append to tables when I already have my data in a structured (or record) array. However, you don't have to do this if your data is in separate arrays (as stated under "avoiding the X-Y problem'). As noted in the PyTables doc for table.append():

The rows argument may be any object which can be converted to a structured array compliant with the table structure.... This includes NumPy structured arrays, lists of tuples or array records...

In other words, you can append with lists referencing your arrays, so long they match the table structure created with description=dt in the example. (I think you are limited to structured arrays at creation.) This might simplify your code.

I wrote an example that builds on @hpaulj's code. It creates 2 identical HDF5 files with different methods.

For the first file (_1.h5) I create the table using the structured array method. I then add 3 rows of data to the table using table.append([list of arrays])
For the second file (_2.h5) I create the table referencing the structured array dtype using description=dt, but do not add data with obj=arr. I then add the first 3 rows of data to the table using table.append([list of arrays]) and repeat to add 3 more rows.

Example below:

import numpy as np
import tables as tb

dt = np.dtype([('field1', np.float), ('field2', np.float, (4, 2))])            
arr = np.zeros(3, dt)     # float display is prettier                                                          
arr['field1'] = np.arange(3)                                                                                                           
arr['field2'] = np.arange(24).reshape(3,4,2)                                   

with tb.File('SO_62104084_1.h5','w') as h5f1:
    test_tb = h5f1.create_table('/','test',obj=arr)
    arr1 = np.arange(13.,16.,1.)                                                                                                           
    arr2 = np.arange(124.,148.,1.).reshape(3,4,2)          
# add rows of data referencing list of arrays: 
    test_tb.append([arr1,arr2])

with tb.File('SO_62104084_2.h5','w') as h5f2:
    test_tb=h5f2.create_table('/','test', description=dt)
    # add data rows 0-2:  
    arr1 = np.arange(3)                                                                                                           
    arr2 = np.arange(24).reshape(3,4,2)                                   
    test_tb.append([arr1,arr2])
# add data rows 3-5:   
    arr1 = np.arange(13.,16.,1.)                                                                                                           
    arr2 = np.arange(124.,148.,1.).reshape(3,4,2)          
    test_tb.append([arr1,arr2])

Collectives™ on Stack Overflow

Combine two NumPy arrays into one structured array for appending to a PyTables table

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related