2

In the Python's standard max function (I also can pass in a key parameter):

s = numpy.array(['one','two','three'])
max(s) # 'two' (lexicographically last)
max(s, key=len) # 'three' (longest string)

With a larger (multi-dimensional) array, I can not longer use max, so I tried to use numpy.amax, however I can't seem to be able to use amax with strings...

t = np.array([['one','two','three'],['four','five','six']])
t.dtype # dtype('|S5')
numpy.amax(t, axis=0) #Error! Hoping for: [`two`, `six`]

Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 1833, in amax
        return amax(axis, out)
TypeError: cannot perform reduce with flexible type

Is it possible to use amax (am using it incorrectly!), or is there some other numpy tool to do this?

1 Answer 1

6

Instead of storing your strings as variable length data in the numpy array, you could try storing them as Python objects instead. Numpy will treat these as references to the original Python string objects, and you can then treat them like you might expect:

t = np.array([['one','two','three'],['four','five','six']], dtype=object)
np.min(t)
# gives 'five'
np.max(t)
# gives 'two'

Keep in mind that here, the np.min and np.max calls are ordering the strings lexicographically - so "two" does indeed come after "five". To change the comparison operator to look at the length of each string, you could try creating a new numpy array identical in form, but containing each string's length instead of its reference. You could then do a numpy.argmin call on that array (which returns the index of the minimum) and look up the value of the string in the original array.


Example code:

# Vectorize takes a Python function and converts it into a Numpy
# vector function that operates on arrays
np_len = np.vectorize(lambda x: len(x))

np_len(t)
# gives array([[3, 3, 5], [4, 4, 3]])

idx = np_len(t).argmin(0) # get the index along the 0th axis
# gives array([0, 0, 1])

result = t
for i in idx[1:]:
    result = result[i]
print result
# gives "two", the string with the smallest length
Sign up to request clarification or add additional context in comments.

4 Comments

Is there a reason why dtype='|S5' is the default rather than 'object'? (I had thought it was the issue :) ). It seems copying the t for each key is going to create a lot of other arrays, especially when these arrays are huge, this seems like an indirect solution...
When you create a numpy array of strings, each is taken as a literal numpy string - just a number of consecutive bytes. As numpy arrays have to (or strive to) have the same size for each object, the default in this case is '|S5' - a string of length 5 - which is the longest string in your input.
If the input arrays are huge, well... yes, it is an indirect solution. Keep in mind that running np_len(t).argmin(0) doesn't save the intermediate array, although it still requires Python to iterate over each element in the interpreter.
Also (for the key), if axis is set then this is going to have to be slightly different (surely there is a built-in way...) I think I will ask this part as a separate question, which was actually my original question if this toy example had been behaving!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.