What does dtype=object mean while creating a numpy array?

Question

I was experimenting with numpy arrays and created a numpy array of strings:

ar1 = np.array(['avinash', 'jay'])

As I have read from from their official guide, operations on numpy array are propagated to individual elements. So I did this:

ar1 * 2

But then I get this error:

TypeError                                 Traceback (most recent call last)
<ipython-input-22-aaac6331c572> in <module>()
----> 1 ar1 * 2

TypeError: unsupported operand type(s) for *: 'numpy.ndarray' and 'int'

But when I used dtype=object

ar1 = np.array(['avinash', 'jay'], dtype=object)

while creating the array I am able to do all operations.

Can anyone tell me why this is happening?

user2357112 · Accepted Answer · 2023-06-15 21:25:10Z

63

NumPy arrays are stored as contiguous blocks of memory. They usually have a single datatype (e.g. integers, floats or fixed-length strings) and then the bits in memory are interpreted as values with that datatype.

Creating an array with dtype=object is different. The memory taken by the array now is filled with pointers to Python objects which are being stored elsewhere in memory (much like a Python list is really just a list of pointers to objects, not the objects themselves).

Arithmetic operators such as * don't work with arrays such as ar1 which have a string_ datatype (there are special functions instead - see below). NumPy is just treating the bits in memory as characters and the * operator doesn't make sense here. However, the line

np.array(['avinash','jay'], dtype=object) * 2

works because now the array is an array of (pointers to) Python strings. The * operator is well defined for these Python string objects. New Python strings are created in memory and a new object array with references to the new strings is returned.

If you have an array with string_ or unicode_ dtype and want to repeat each string, you can use np.char.multiply:

In [52]: np.char.multiply(ar1, 2)
Out[52]: array(['avinashavinash', 'jayjay'], 
      dtype='<U14')

NumPy has many other vectorised string methods too.

edited Jun 15, 2023 at 21:25

user2357112

286k32 gold badges490 silver badges569 bronze badges

answered Apr 26, 2015 at 12:50

Alex Riley

178k46 gold badges274 silver badges247 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user2886057 Over a year ago

do you know if there are any efficiency gains to be had when doing an operation on an object array vs trying to iterate through the a python list?

cottontail · Accepted Answer · 2025-02-26 20:47:15Z

There are 3 main dtypes to store strings in numpy:

object: Stores pointers to Python objects
str: Stores fixed-width strings
numpy.types.StringDType(): New in numpy 2.0 and stores variable-width strings

`str` consumes more memory than `object`; `StringDType` is better

Depending on the length of the fixed-length string and the size of the array, the ratio differs but as long as the longest string in the array is longer than 2 characters, str consumes more memory (they are equal when the longest string in the array is 2 characters long). For example, in the following example, str consumes almost 8 times more memory.

On the other hand, the new (in numpy>=2.0) numpy.dtypes.StringDType stores variable width strings, so consumes much less memory.

from pympler.asizeof import asizeof

ar1 = np.array(['this is a string', 'string']*1000, dtype=object)
ar2 = np.array(['this is a string', 'string']*1000, dtype=str)
ar3 = np.array(['this is a string', 'string']*1000, dtype=np.dtypes.StringDType())

asizeof(ar2) / asizeof(ar1)  # 7.944444444444445
asizeof(ar3) / asizeof(ar1)  # 1.992063492063492

For numpy 1.x, `str` is slower than `object`

For numpy>=2.0.0, `str` is faster than `object`

Numpy 2.0 has introduced a new numpy.strings API that has much more performant ufuncs for string operations. A simple test (on numpy 2.2.0) below shows that vectorized string operations on an array of str or StringDType dtype is much faster than the same operations on an object dtype array.

import timeit

t1 = min(timeit.repeat(lambda: ar1*2, number=1000))
t2a = min(timeit.repeat(lambda: np.strings.multiply(ar2, 2), number=1000))
t2b = min(timeit.repeat(lambda: np.strings.multiply(ar3, 2), number=1000))
print(t2a / t1)   # 0.8786601958427778
print(t2b / t1)   # 0.7311586933668037

t3 = min(timeit.repeat(lambda: np.array([s.count('i') for s in ar1]), number=1000))
t4a = min(timeit.repeat(lambda: np.strings.count(ar2, 'i'), number=1000))
t4b = min(timeit.repeat(lambda: np.strings.count(ar3, 'i'), number=1000))

print(t4a / t3)   # 0.13328748153237377
print(t4b / t3)   # 0.3365874412749679

For numpy<2.0.0 (tested on numpy 1.26.0)

Numpy 1.x's vectorized string methods are not optimized, so operating on the object array is often faster. For example, in the example in the OP where each character is repeated, a simple * (aka multiply()) is not only more concise but also over 10 times faster than char.multiply().

import timeit
setup = "import numpy as np; from __main__ import ar1, ar2"
t1 = min(timeit.repeat("ar1*2", setup, number=1000))
t2 = min(timeit.repeat("np.char.multiply(ar2, 2)", setup, number=1000))
t2 / t1   # 10.650433758517027

Even for functions that cannot be readily be applied on the array, instead of the vectorized char method of str arrays, it is faster to loop over the object array and work on the Python strings.

For example, iterating over the object array and calling str.count() on each Python string is over 3 times faster than the vectorized char.count() on the str array.

f1 = lambda: np.array([s.count('i') for s in ar1])
f2 = lambda: np.char.count(ar2, 'i')

setup = "import numpy as np; from __main__ import ar1, ar2, f1, f2, f3"
t3 = min(timeit.repeat("f1()", setup, number=1000))
t4 = min(timeit.repeat("f2()", setup, number=1000))

t4 / t3   # 3.251369161574832

On a side note, if it comes to explicit loop, iterating over a list is faster than iterating over a numpy array. So in the previous example, a further performance gain can be made by iterating over the list

f3 = lambda: np.array([s.count('i') for s in ar1.tolist()])
#                                               ^^^^^^^^^  <--- convert to list here
t5 = min(timeit.repeat("f3()", setup, number=1000))
t3 / t5   # 1.2623498005294627

Collectives™ on Stack Overflow

What does dtype=object mean while creating a numpy array?

2 Answers 2

1 Comment

`str` consumes more memory than `object`; `StringDType` is better

For numpy 1.x, `str` is slower than `object`

For numpy>=2.0.0, `str` is faster than `object`

For numpy<2.0.0 (tested on numpy 1.26.0)

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

str consumes more memory than object; StringDType is better

For numpy 1.x, str is slower than object

For numpy>=2.0.0, str is faster than object

For numpy<2.0.0 (tested on numpy 1.26.0)

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

`str` consumes more memory than `object`; `StringDType` is better

For numpy 1.x, `str` is slower than `object`

For numpy>=2.0.0, `str` is faster than `object`