3

I'm trying to instantiate my test set for classification, loading a dataset with 41 features and 1 label:

import numpy as np

f = open("mydataset")
dataset = np.genfromtxt(f, delimiter=',', dtype=None)

X = dataset[:, 0:40]  # select columns 1 through 41
y = dataset[:, 41]  # select column 42 (the labels)

Since mydataset is not homogeneous (not all elements have the same type), the function genfromtxt creates a 1D array (a list of tuples). So I get this error:

X = dataset[:, 0:40]  # select columns 1 through 41
IndexError: too many indices for array

How can I solve this? Have I to transform the numpy array in 2D (if yes, in which way)? Or have I to use another way to select the right columns?

Thanks

0

1 Answer 1

3

You could define a compound dtype:

dt = np.dtype([('values',float,(41,)),('labels','S10')])
data=np.genfromtxt(f, delimiters=',',dtype=dt)
X = data['values']
Y = data['labels']

(not tested because I don't have a sample array this size).

And as I describe in a recent answer, https://stackoverflow.com/a/37126091/901925,

you could convert the dtype=None data to this compound dtype with

data.view(dt)

though that requires that all the numbers be loaded as float (or all as ints). Often CSVs have a mix of float and integer columns, so the numeric fields of a None genfromtxt call will be a mix of types.

Borrowing from that other answer, a general structured array might look like:

In [421]: data=np.array([('label1', 12, 23.2, 232.0), ('label2', 23, 2324.0, 324.0),
       ('label3', 34, 123.0, 2141.0), ('label4', 0, 2.0, 3.0)], 
      dtype=[('f0', '<U10'), ('f1', '<i4'), ('f2', '<f8'), ('f3', '<f8')])

4 fields with different dtypes.

Individual fields can be accessed by name: data['f0'], or a list of names data[['f0','f3']]. But the things you can do with the list of names is limited.

In [426]: data[['f2','f3']]=10
...
ValueError: multi-field assignment is not supported

You can do more if you make a copy, and more if you view it as homogeneous array:

In [427]: d23=data[['f2','f3']].copy()

In [428]: d23
Out[428]: 
array([(23.2, 232.0), (2324.0, 324.0), (123.0, 2141.0), (2.0, 3.0)], 
      dtype=[('f2', '<f8'), ('f3', '<f8')])

In [429]: d23=d23.view((float,(2,)))

In [430]: d23
Out[430]: 
array([[  2.32000000e+01,   2.32000000e+02],
       [  2.32400000e+03,   3.24000000e+02],
       [  1.23000000e+02,   2.14100000e+03],
       [  2.00000000e+00,   3.00000000e+00]])

In [431]: d23+=34

In [432]: d23
Out[432]: 
array([[   57.2,   266. ],
       [ 2358. ,   358. ],
       [  157. ,  2175. ],
       [   36. ,    37. ]])

(changes to d23 do not affect the original data).

Sign up to request clarification or add additional context in comments.

4 Comments

I have also another problem, the 41 features are also not homogeneous (some of them are string)
I added some examples of accessing a structured array as might be produced with dtype=None.
Thanks @hpaulj ! Instead what do you think if I used a for-loop to create a list of lists (passing one raw at time) and then convert it into nparray? Works it well with different types of fields?
It may be best if you ask that last one in a new question - with an example.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.