Delete column from a numpy structured array (list of tuples in the array)?

Question

I use an external library function which returns a numpy structured array.

cities_array
>>> array([ (1, [-122.46818353792992, 48.74387985436505], u'05280', u'Bellingham', u'53', u'Washington', u'5305280', u'city', u'N', -99, 52179),
       (2, [-109.67985528815007, 48.54381826401885], u'35050', u'Havre', u'30', u'Montana', u'3035050', u'city', u'N', 2494, 10201),
       (3, [-122.63068540357023, 48.49221584868184], u'01990', u'Anacortes', u'53', u'Washington', u'5301990', u'city', u'N', -99, 11451),
       ...,
       (3147, [-156.45657614262274, 20.870633142444376], u'22700', u'Kahului', u'15', u'Hawaii', u'1522700', u'census designated place', u'N', 7, 16889),
       (3148, [-156.45038252004554, 20.76059218396], u'36500', u'Kihei', u'15', u'Hawaii', u'1536500', u'census designated place', u'N', -99, 11107),
       (3149, [-155.08472452266503, 19.693112205773275], u'14650', u'Hilo', u'15', u'Hawaii', u'1514650', u'census designated place', u'N', 38, 37808)], 
      dtype=[('ID', '<i4'), ('Shape', '<f8', (2,)), ('CITY_FIPS', '<U5'), ('CITY_NAME', '<U40'), ('STATE_FIPS', '<U2'), ('STATE_NAME', '<U25'), ('STATE_CITY', '<U7'), ('TYPE', '<U25'), ('CAPITAL', '<U1'), ('ELEVATION', '<i4'), ('POP1990', '<i4')])

The cities_array is of type <type 'numpy.ndarray'>.

I am able to access individual columns of the array:

cities_array[['ID','CITY_NAME']]
>>> array([(1, u'Bellingham'), (2, u'Havre'), (3, u'Anacortes'), ...,
       (3147, u'Kahului'), (3148, u'Kihei'), (3149, u'Hilo')], 
      dtype=[('ID', '<i4'), ('CITY_NAME', '<U40')])

Now I want to delete the first column, ID. The help and SO questions say it should be numpy.delete.

When running this: numpy.delete(cities_array,cities_array['ID'],1) I get the error message:

...in delete
    N = arr.shape[axis]
IndexError: tuple index out of range

What am I doing wrong? Should I post-process the cities_array to be able to work with the array?

I am on Python 2.7.10 and numpy 1.11.0

As shown in the answer you can view a subset of the dtype names. It's not a true delete. There is also a library of recfuncs that might implement a deleting copy. — hpaulj
– hpaulj, Commented Apr 25, 2016 at 13:06
@hpaulj, thanks for the comment, good to know there is an external library for that. But isn't it strange that such a basic operation fails? Just a simple array x = numpy.zeros(3, dtype={'names':['col1', 'col2'], 'formats':['i4','f4']}) fails to delete a column with numpy.delete(x,0,1). What is the rout cause of this issue, any ideas? — Alex Tereshenkov
– Alex Tereshenkov, Commented Apr 25, 2016 at 14:04

David de la Iglesia · Accepted Answer · 2016-04-25 10:38:00Z

2

I think that this should work:

def delete_colum(array, *args):

    filtered = [x for x in array.dtype.names if x not in args]

    return array[filtered]

Example with array:

a
Out[9]: 
array([(1, [-122.46818353792992, 48.74387985436505])], 
      dtype=[('ID', '<i4'), ('Shape', '<f8', (2,))])

delete_colum(a,'ID')
Out[11]: 
array([([-122.46818353792992, 48.74387985436505],)], 
      dtype=[('Shape', '<f8', (2,))])

answered Apr 25, 2016 at 10:38

David de la Iglesia

2,54417 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

hpaulj · Accepted Answer · 2016-04-25 17:38:53Z

You comment:

But isn't it strange that such a basic operation fails? Just a simple array x = numpy.zeros(3, dtype={'names':['col1', 'col2'], 'formats':['i4','f4']}) fails to delete a column with numpy.delete(x,0,1). What is the rout cause of this issue, any ideas?

np.delete isn't a basic operation. Look at it's code. It's 5 screens long (on Ipython). A lot of that handles the different ways that you can specify the delete elements.

For np.delete(x, 0, axis=1)

it uses a special case

    # optimization for a single value
    ...
    newshape[axis] -= 1
    new = empty(newshape, arr.dtype, arrorder)
    slobj[axis] = slice(None, obj)
    new[slobj] = arr[slobj]
    slobj[axis] = slice(obj, None)
    slobj2 = [slice(None)]*ndim
    slobj2[axis] = slice(obj+1, None)
    new[slobj] = arr[slobj2]

For a 2d array, and axis=1 it does:

new = np.zeros((x.shape[0], x.shape[1]-1), dtype=x.dtype)
new[:, :obj] = x[:, :obj]
new[:, obj:] = x[:, obj+1:]

In other words, it allocates a new array with 1 less column, and then copies two slices from the original to it.

With multiple delete columns and boolean obj it takes other routes.

Notice that fundamental to that action is the ability to index the 2 dimensions.

But you can't index your x that way. x[0,1] gives a too many indices error. You have to use x[0]['col1']. Indexing the fields of a dtype is fundamentally different from indexing the columns of a 2d array.

The recfunctions manipulate the dtype fields in ways that regular numpy functions don't. Based on previous study, I'm guessing that drop_field does something like this:

In [57]: x    # your x with some values
Out[57]: 
array([(1, 3.0), (2, 2.0), (3, 1.0)], 
      dtype=[('col1', '<i4'), ('col2', '<f4')])

Target array, with different dtype (missing one field)

In [58]: y=np.zeros(x.shape, dtype=x.dtype.descr[1:])

copy values, field by field:

In [60]: for name in y.dtype.names:
    ...:     y[name]=x[name]
In [61]: y
Out[61]: 
array([(3.0,), (2.0,), (1.0,)], 
      dtype=[('col2', '<f4')])

Regular n-d indexing is built around the shape and strides attributes. With these (and the element byte size) it can quickly identify the location in the data buffer of a desired element.

With a compound dtype, shape and strides work the same way, but nbytes is different. In your x case it is 24 - 12 each for the i4 and f4 fields. So regular indexing steps from one 24 bit record to the next. So to select the 'col2' field, it has take the further step of selecting the 2nd set of 4 bytes within each record.

Where possible I think it translates field selection into regular indexing. __array_interface__ is a nice dictionary of the essential attributes of an array.

In [70]: x.__array_interface__
Out[70]: 
{'data': (68826112, False),
 'descr': [('col1', '<i4'), ('col2', '<f4')],
 'shape': (3,),
 'strides': None,
 'typestr': '|V8',
 'version': 3}

In [71]: x['col2'].__array_interface__
Out[71]: 
{'data': (68826116, False),
 'descr': [('', '<f4')],
 'shape': (3,),
 'strides': (8,),
 'typestr': '<f4',
 'version': 3}

The second array points to the same data buffer, but 4 bytes further along (the first col2 value). In effect it is a view.

(np.transpose is another function that does not operate across the dtype boundary.)

===================

Here's the code for drop_fields (summarized):

In [74]: from numpy.lib import recfunctions  # separate import statement
In [75]: recfunctions.drop_fields??

def drop_fields(base, drop_names, usemask=True, asrecarray=False):
    .... # define `drop_descr function
    newdtype = _drop_descr(base.dtype, drop_names)
    output = np.empty(base.shape, dtype=newdtype)
    output = recursive_fill_fields(base, output)
    return output

recursive_fill_fields does a name by name field copy, and is able to handle dtypes that define fields within fields (the recursive part).

In [81]: recfunctions.drop_fields(x, 'col1')
Out[81]: 
array([(3.0,), (2.0,), (1.0,)], 
      dtype=[('col2', '<f4')])

In [82]: x[['col2']]  # multifield selection that David suggests
Out[82]: 
array([(3.0,), (2.0,), (1.0,)], 
      dtype=[('col2', '<f4')])

In [83]: x['col2']     # single field view
Out[83]: array([ 3.,  2.,  1.], dtype=float32)

drop_field produces a similar result as the multifield indexing that @David suggests. However that multifield indexing is poorly developed, as you will see if you try some sort of assignment.

Collectives™ on Stack Overflow

Delete column from a numpy structured array (list of tuples in the array)?

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related