How to copy dtype when doing numpy array assignment or when appending to a numpy array

Question

I'm pretty illiterate in using Python/numpy.

I have the following piece of code:

data = np.array([])

for i in range(10):
    data = np.append(data, GetData())

return data

GetData() returns a numpy array with a custom dtype. However when executing the above piece of code, the numbers convert to float64 which I suspect is the culprit for other issues I'm having. How can I copy/append the output of the functions while preserving the dtype as well?

look at the original data array. Some functions let you specify the dtype. Read the docs. — hpaulj
– hpaulj, Commented Feb 10, 2022 at 16:11
When you declare a numpy array, you declare it with a type (which here by default is float64). If you append anything to that array, it will be converted to that type. If GetData() returns different types, you will not be able to keep them inside the array. To declare the numpy array with a specific type, you can do for instance: data = np.array([], dtype=np.int32). — Jenny
– Jenny, Commented Feb 10, 2022 at 16:15
Thanks. I think both your comments point that I have to know the type a-priori and can't have automatic conversion. That's a bummer, because I'm leveraging a helper function which may get data that is int64 or float64, etc. Was hoping that the same helper function works without hard-coding the dtype. — Amir
– Amir, Commented Feb 10, 2022 at 16:23
why don't you create the array once you know which type you need it to be ? — Jenny
– Jenny, Commented Feb 10, 2022 at 16:25
darn. I'm Python noob. How do I do that? How do I copy the dtype? — Amir
– Amir, Commented Feb 10, 2022 at 16:29

marc_s · Accepted Answer · 2022-02-19 18:04:46Z

Given the comments stating that you will only know the type of data once you run GetData(), and that multiple types are expected, you could do it like so:

# [...]

dataByType = {} # dictionary to store the dtypes encountered and the arrays with given dtype

for i in range(10):
    newData = GetData()
    if newData.dtype not in dataByType:
        # If the dtype has not been encountered yet,
        # create an empty array with that dtype and store it in the dict
        dataByType[newData.dtype] = np.array([], dtype=newData.dtype)
    # Append the new data to the corresponding array in dict, depending on dtype
    dataByType[newData.dtype] = np.append(dataByType[newData.dtype], newData)

Taking into account hpaulj's answer, if you wish to conserve the different types you might encounter without creating a new array at each iteration you can adapt the above to:

# [...]

dataByType = {} # dictionary to store the dtypes encountered and the list storing data with given dtype

for i in range(10):
    newData = GetData()
    if newData.dtype not in dataByType:
        # If the dtype has not been encountered yet,
        # create an empty list with that dtype and store it in the dict
        dataByType[newData.dtype] = []
    # Append the new data to the corresponding list in dict, depending on dtype
    dataByType[newData.dtype].append(newData)

# At this point, you have all your data pieces stored according to their original dtype inside the dataByType dictionary.
# Now if you wish you can convert them to numpy arrays as well

# Either by concatenation, updating what is stored in the dict
for dataType in dataByType:
    dataByType[dataType] = np.concatenate(dataByType[dataType])
    # No need to specify the dtype in concatenate here, since previous step ensures all data pieces are the same type

# Or by creating array directly, to store each data piece at a different index
for dataType in dataByType:
    dataByType[dataType] = np.array(dataByType[dataType])
    # As for concatenate, no need to specify the dtype here

A little example:

import numpy as np

# to get something similar to GetData in the example structure:
getData = [
    np.array([1.,2.], dtype=np.float64),
    np.array([1,2], dtype=np.int64),
    np.array([3,4], dtype=np.int64),
    np.array([3.,4.], dtype=np.float64)
    ] # dtype precised here for clarity, but not needed


dataByType = {}

for i in range(len(getData)):
    newData = getData[i]
    if newData.dtype not in dataByType:
        dataByType[newData.dtype] = []
    dataByType[newData.dtype].append(newData)

print(dataByType) # output formatted below for clarity
# {dtype('float64'): 
#     [array([1., 2.]), array([3., 4.])],
#  dtype('int64'): 
#     [array([1, 2], dtype=int64), array([3, 4], dtype=int64)]}

Now if we use concatenate on that dataset, we get 1D arrays, conserving the original type (dtype=float64 not precised in the output since it is the default type for floating point values):

for dataType in dataByType:
    dataByType[dataType] = np.concatenate(dataByType[dataType])

print(dataByType) # once again output formatted for clarity
# {dtype('float64'):
#      array([1., 2., 3., 4.]),
#  dtype('int64'):
#      array([1, 2, 3, 4], dtype=int64)}

And if we use array, we get 2D arrays:

for dataType in dataByType:
    dataByType[dataType] = np.array(dataByType[dataType])

print(dataByType)
# {dtype('float64'): 
#      array([[1., 2.],
#             [3., 4.]]),
#  dtype('int64'): 
#      array([[1, 2],
#             [3, 4]], dtype=int64)}

Important thing to note: using array will not work as intended if all the arrays to combine don't have the same shape:

import numpy as np

print(repr(np.array([
                np.array([1,2,3]),
                np.array([4,5])])])))
# array([array([1, 2, 3]), array([4, 5])], dtype=object)

You get an array of dtype object, which are all in this case arrays of different lengths.

A variation of this solution actually fixed my issue. That said, the code looks ugly as I need to differentiate between first time seeing a dtype vs. subsequent appends. wondering if there is a better-looking solution out there. This definitely works and I'm grateful for your answer.

hpaulj · Accepted Answer · 2022-02-10 17:03:58Z

0

Your use of [] and append indicates that your are naively copying that common list idiom:

alist = []
for x in another_list:
   alist.append(x)

Your data is not a clone of the [] list:

In [220]: np.array([])
Out[220]: array([], dtype=float64)

It's an array with shape (0,) and dtype float.

np.append is not an list append clone. I stress that, because too many new users make that mistake, and the result is many different errors. It is really just a cover for np.concatenate, one that takes 2 arguments instead of a list of arguments. As the docs stress it returns a new array, and when used iteratively, that means a lot of copying.

It is best to collect your arrays in a list, and give it to concatenate. List append is in-place, and better when done iteratively. If you give concatenate a list of arrays, the resulting dtype will be the common one (or whatever promoting requires). (new versions do let you specify dtype when calling concatenate.)

Keep the numpy documentation at hand (python too if necessary), and look up functions. Pay attention to how they are called, including the keyword parameters). And practice with small examples. I keep an interactive python session at hand, even when writing answers.

When working with arrays, pay close attention to shape and dtype. Don't make assumptions.

concatenating 2 int arrays:

In [238]: np.concatenate((np.array([1,2]),np.array([4,3])))
Out[238]: array([1, 2, 4, 3])

making one a float array (just by adding a decimal point to one number):

In [239]: np.concatenate((np.array([1,2]),np.array([4,3.])))
Out[239]: array([1., 2., 4., 3.])

It won't let me change the result to int:

In [240]: np.concatenate((np.array([1,2]),np.array([4,3.])), dtype=int)
Traceback (most recent call last):
  File "<ipython-input-240-91b4e3fec07a>", line 1, in <module>
    np.concatenate((np.array([1,2]),np.array([4,3.])), dtype=int)
  File "<__array_function__ internals>", line 180, in concatenate
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'same_kind'

If an element is a string, the result is also a string dtype:

In [241]: np.concatenate((np.array([1,2]),np.array(['4',3.])))
Out[241]: array(['1', '2', '4', '3.0'], dtype='<U32')

Sometimes it is necessary to adjust dtypes after a calculation:

In [243]: np.concatenate((np.array([1,2]),np.array(['4',3.]))).astype(float)
Out[243]: array([1., 2., 4., 3.])
In [244]: np.concatenate((np.array([1,2]),np.array(['4',3.]))).astype(float).as
     ...: type(int)
Out[244]: array([1, 2, 4, 3])

edited Feb 10, 2022 at 17:03

answered Feb 10, 2022 at 16:54

hpaulj

233k14 gold badges260 silver badges392 bronze badges

2 Comments

Amir Over a year ago

Thanks for that info. Performance is not necessary my biggest concern right now. What I was trying to figure out from your answer is how to preserve the dtype by converting to array and iteratively add to the array before calling concatenate on all the pieces. Can you give me a snippet of a code for the problem I'm trying to solve. That will probably help me better understand your point.

hpaulj Over a year ago

Adding pieces to a list preserves dtype, since the list just contains references to those arrays. It doesn't copy anything. By trying to add pieces to an existing array iteratively, you not only have the copying issue, but you have to make sure the dtypes are right - including the initial "blank". If you don't know anything about the arrays to start with, don't do the iterative concatenate. Performance or not, building the list first is the most robust approach.

Collectives™ on Stack Overflow

How to copy dtype when doing numpy array assignment or when appending to a numpy array

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related