0

I'm trying to replicate the format of an existing data file which has the following class structure when loaded with np.load:

<class 'numpy.ndarray'>
    <class 'list'>
        <class 'list'>
           <class 'numpy.str_'>

It is a ndarray with lists of lists of strings.

I'm using the following code to create the same structure, a list of lists of lists of strings and trying to convert the outermost list into a ndarray without also converting the inner lists into ndarrays.

captions = []
for row in attrs.iterrows():

    sorted_row = row[1].sort_values(ascending=False)

    attributes, variations = [], []
    for col, val in sorted_row[:20].iteritems():
        attributes.append([x[1] for x in word2Id if x[0] == col][0])
    variations.append(attributes)

    for i in range(9):
        variations.append(random.sample(attributes, len(attributes)))

    captions.append(variations)

np.save('train_captions.npy', captions)

When I open the resulting npy file, the class hierarchy is like this:

<class 'numpy.ndarray'>
    <class 'numpy.ndarray'>
        <class 'numpy.ndarray'>
           <class 'numpy.str_'>

How can I store captions in the code above so that it has the same structure as the file at the very top.

1
  • np.save can only save numpy arrays. When given the list, it first does np.array(captions). That turns the nested lists into a multidimensional array. Constructing an array of lists is tricky, especially if the lists all have the same size. Look at the array dtype and shape rather than the class hierarchy. Commented May 2, 2018 at 5:32

2 Answers 2

2
import numpy as np

list = ["a", "b", "c", "d"]
np.save('list.npy', list)
read_list = np.load('list.npy').tolist()
print(read_list, type(read_list))

>>>['a', 'b', 'c', 'd'] <class 'list'>

If we don't use .tolist() the result is:

['a' 'b' 'c' 'd'] <class 'numpy.ndarray'>
Sign up to request clarification or add additional context in comments.

Comments

1

When I try to replicate your code (more or less):

In [273]: captions = []
In [274]: for r in range(2):
     ...:     attributes, variations = [], []
     ...:     for c in range(2):
     ...:         attributes.append([i for i in ['a','b','c']])
     ...:     variations.append(attributes)
     ...:     for i in range(2):
     ...:         variations.append(random.sample(attributes, len(attributes)))
     ...:     captions.append(variations)
     ...:         
In [275]: captions
Out[275]: 
[[[['a', 'b', 'c'], ['a', 'b', 'c']],
  [['a', 'b', 'c'], ['a', 'b', 'c']],
  [['a', 'b', 'c'], ['a', 'b', 'c']]],
 [[['a', 'b', 'c'], ['a', 'b', 'c']],
  [['a', 'b', 'c'], ['a', 'b', 'c']],
  [['a', 'b', 'c'], ['a', 'b', 'c']]]]

The list has several levels of nesting. When passed to np.array, the result is a 4d array of strings:

In [276]: arr = np.array(captions)
In [277]: arr.shape
Out[277]: (2, 3, 2, 3)
In [278]: arr.dtype
Out[278]: dtype('<U1')

Where possible np.array tries to make as high dimensional array as it can.

To make an array of lists, we have to do something like:

In [279]: arr = np.empty(2, dtype=object)
In [280]: arr[0] = captions[0]
In [281]: arr[1] = captions[1]
In [282]: arr
Out[282]: 
array([list([[['a', 'b', 'c'], ['a', 'b', 'c']], [['a', 'b', 'c'], ['a', 'b', 'c']], [['a', 'b', 'c'], ['a', 'b', 'c']]]),
       list([[['a', 'b', 'c'], ['a', 'b', 'c']], [['a', 'b', 'c'], ['a', 'b', 'c']], [['a', 'b', 'c'], ['a', 'b', 'c']]])],
      dtype=object)

1 Comment

Thanks, this worked. However, using the nested ndarrays works fine for the model I'm training anyway.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.