read values from a text file using numpy loadtxt function

Question

I have a file with this form:

label1, value1, value2, value3,
label2, value1, value2, value3,
...

I want to read it using numpy loadtxt function so I can have each label with its values in an array, so the final result will be an array of arrays, each array of them include the label and an array of features like this:

array([[label1, [value1, value2, value3]],
       [label2, [value1, value2, value3]]])

I have tried the following but did not work:

c = StringIO(u"text.txt")
np.loadtxt(c,
   dtype={'samples': ('label', 'features'), 'formats': ('s9',np.float)},
   delimiter=',', skiprows=0)

any idea?

B. M. · Accepted Answer · 2016-05-09 21:02:45Z

3

The most modern and versatile way to do that is to use pandas, whose parser have many more options and manage labels.

Suppose your file contains :

A,7,5,1
B,4,2,7

Then :

In [29]: import pandas as pd
In [30]: df=pd.read_csv('data.csv',sep=',',header=None,index_col=0)

In [31]: df
Out[31]: 
   1  2  3
0         
A  7  5  1
B  4  2  7

You can easily convert it in an struct array now :

In [32]: a=df.T.to_records(index=False)
Out[32]: 
rec.array([(7, 4), (5, 2), (1, 7)], 
          dtype=[('A', '<i8'), ('B', '<i8')])

In [33]: a['A']
Out[33]: array([7, 5, 1], dtype=int64)

With loadtext you will have to do a lot of low level operations manually.

edited May 9, 2016 at 21:02

answered May 9, 2016 at 20:57

B. M.

18.7k2 gold badges40 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

hpaulj · Accepted Answer · 2016-05-10 23:53:23Z

2

You are on the right tract with defining the dtype. You are just missing the field shape.

I'll demonstrate:

A 'text' file - a list of lines (bytes in Py3):

In [95]: txt=b"""label1, 12, 23.2, 232
   ....: label2, 23, 2324, 324
   ....: label3, 34, 123, 2141
   ....: label4, 0, 2, 3
   ....: """

In [96]: txt=txt.splitlines()

A dtype with 2 fields, one with strings, the other with floats (3 for 'field shape'):

In [98]: dt=np.dtype([('label','U10'),('values', 'float',(3))])

In [99]: data=np.genfromtxt(txt,delimiter=',',dtype=dt)

In [100]: data
Out[100]: 
array([('label1', [12.0, 23.2, 232.0]), ('label2', [23.0, 2324.0, 324.0]),
       ('label3', [34.0, 123.0, 2141.0]), ('label4', [0.0, 2.0, 3.0])], 
      dtype=[('label', '<U10'), ('values', '<f8', (3,))])

In [101]: data['label']
Out[101]: 
array(['label1', 'label2', 'label3', 'label4'], 
      dtype='<U10')

In [103]: data['values']
Out[103]: 
array([[  1.20000000e+01,   2.32000000e+01,   2.32000000e+02],
       [  2.30000000e+01,   2.32400000e+03,   3.24000000e+02],
       [  3.40000000e+01,   1.23000000e+02,   2.14100000e+03],
       [  0.00000000e+00,   2.00000000e+00,   3.00000000e+00]])

With this definition the numeric values can be accessed as a 2d array. Sub-arrays like this are under appreciated.

The dtype could be been specified with the dictionary syntax, but I'm more familiar with the list of tuples form.

Equivalent dtype specs:

np.dtype("U10, (3,)f")
np.dtype({'names':['label','values'], 'formats':['S10','(3,)f']})
np.genfromtxt(txt,delimiter=',',dtype='S10,(3,)f')

===============================

I think that this txt, if parsed with dtype=None would produce

In [30]: y
Out[30]: 
array([('label1', 12.0, 23.2, 232.0), ('label2', 23.0, 2324.0, 324.0),
       ('label3', 34.0, 123.0, 2141.0), ('label4', 0.0, 2.0, 3.0)], 
      dtype=[('f0', '<U10'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8')])

The could be converted to the subfield form with

y.view(dt)

This works as long as the underlying data representation (seen as a flat list of bytes) is compatible (here 10 unicode characters (40 bytes), and 3 floats, per record).

edited May 10, 2016 at 23:53

answered May 9, 2016 at 21:56

hpaulj

233k14 gold badges260 silver badges392 bronze badges

4 Comments

M.Alsioufi Over a year ago

that's very useful for me but I get an error "size of tuple must match number of fields." my actual txt file is the same posted example however, there's a label and 22 other values, so my code was

txt=StringIO(u"dataset.txt")  dt=np.dtype([('label','U10'),('features', 'float',(22))]) data=np.genfromtxt(txt,delimiter=',',dtype=dt)

hpaulj Over a year ago

Yes, the total number fields, named or in sub-arrays, needs to match the number of columns in the file, or in your usecols parameter.

M.Alsioufi Over a year ago

Yes, I noticed that, however, I don't know why it does not work and showed me that error ..

hpaulj Over a year ago

Try a dtype=None to see what sort of dtype it deduces from the data. That might help you correct your definition. Within limits you might even be able to translate from that dtype to yours with astype or view.

Collectives™ on Stack Overflow

read values from a text file using numpy loadtxt function

2 Answers 2

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related