Arrays of strings into numpy.amax

Question

In the Python's standard max function (I also can pass in a key parameter):

s = numpy.array(['one','two','three'])
max(s) # 'two' (lexicographically last)
max(s, key=len) # 'three' (longest string)

With a larger (multi-dimensional) array, I can not longer use max, so I tried to use numpy.amax, however I can't seem to be able to use amax with strings...

t = np.array([['one','two','three'],['four','five','six']])
t.dtype # dtype('|S5')
numpy.amax(t, axis=0) #Error! Hoping for: [`two`, `six`]

Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 1833, in amax
        return amax(axis, out)
TypeError: cannot perform reduce with flexible type

Is it possible to use amax (am using it incorrectly!), or is there some other numpy tool to do this?

Peter Sobot · Accepted Answer · 2012-09-29 15:58:58Z

6

Instead of storing your strings as variable length data in the numpy array, you could try storing them as Python objects instead. Numpy will treat these as references to the original Python string objects, and you can then treat them like you might expect:

t = np.array([['one','two','three'],['four','five','six']], dtype=object)
np.min(t)
# gives 'five'
np.max(t)
# gives 'two'

Keep in mind that here, the np.min and np.max calls are ordering the strings lexicographically - so "two" does indeed come after "five". To change the comparison operator to look at the length of each string, you could try creating a new numpy array identical in form, but containing each string's length instead of its reference. You could then do a numpy.argmin call on that array (which returns the index of the minimum) and look up the value of the string in the original array.

Example code:

# Vectorize takes a Python function and converts it into a Numpy
# vector function that operates on arrays
np_len = np.vectorize(lambda x: len(x))

np_len(t)
# gives array([[3, 3, 5], [4, 4, 3]])

idx = np_len(t).argmin(0) # get the index along the 0th axis
# gives array([0, 0, 1])

result = t
for i in idx[1:]:
    result = result[i]
print result
# gives "two", the string with the smallest length

edited Sep 29, 2012 at 15:58

answered Sep 29, 2012 at 15:48

Peter Sobot

2,55721 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Andy Hayden Over a year ago

Is there a reason why dtype='|S5' is the default rather than 'object'? (I had thought it was the issue :) ). It seems copying the t for each key is going to create a lot of other arrays, especially when these arrays are huge, this seems like an indirect solution...

Peter Sobot Over a year ago

When you create a numpy array of strings, each is taken as a literal numpy string - just a number of consecutive bytes. As numpy arrays have to (or strive to) have the same size for each object, the default in this case is '|S5' - a string of length 5 - which is the longest string in your input.

Peter Sobot Over a year ago

If the input arrays are huge, well... yes, it is an indirect solution. Keep in mind that running np_len(t).argmin(0) doesn't save the intermediate array, although it still requires Python to iterate over each element in the interpreter.

Andy Hayden Over a year ago

Also (for the key), if axis is set then this is going to have to be slightly different (surely there is a built-in way...) I think I will ask this part as a separate question, which was actually my original question if this toy example had been behaving!

Collectives™ on Stack Overflow

Arrays of strings into numpy.amax

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related