Numpy array while importing from CSV

Question

I have a problem and I don't know how to handle it.

I have CSV file like this:

0.3,36.22683698,-115.0466482,1836.255238,0,0,0.2105903662,0.6848089322,41.15086807,2016/3/26,4:35:51
0.6,36.22683698,-115.0466482,1836.255238,0,0,0.2105903662,0.6848089322,41.15086807,2016/3/26,4:35:51
0.9,36.22683698,-115.0466482,1836.255238,0,0,0.2105903662,0.6848089322,41.15086807,2016/3/26,4:35:51

As you can see, first I have 9 float values and then 2 other which I would like load as string. Delimiter = ,

When I use:

load = np.genfromtxt(str(path), delimiter=',')
print load[0,4]

it prints value from row 0, column 4 and it works. Data is loaded properly. But there is a problem, because last 2 values are nan

print load[0,10]
>>nan

When I change my code into this:

load = np.genfromtxt(str(path), delimiter=',',dtype=None)

I get error:

print load[0,4]
IndexError: too many indices for array

So everything works unless I add dtype=None

What I'm doing wrong?

The are lots of questions about genfromtxt producing a 1d array. Posters don't realize it has a compound dtype. Read about structured arrays. — hpaulj
– hpaulj, Commented Apr 4, 2016 at 15:12

Francesco Nazzaro · Accepted Answer · 2016-04-04 13:53:45Z

3

You can't create numpy array with several dtype. You have to import your csv with dtype=str

import numpy as np
load = np.genfromtxt(str(path), delimiter=',',dtype=str)

With dtype=None it creates a numpy array with shape (3,). So you can't call load[0, 4].

Each entry is a tuple with your data. Because tuples can contain several types.

maybe for your purpose you have to use pandas:

import pandas as pd
load = pd.read_csv(str(path), header=None)

the output is the following:

0          1           2            3   4   5        6         7   \
0 0.3 36.226837 -115.046648 1836.255238 0 0 0.21059 0.684809
1 0.6 36.226837 -115.046648 1836.255238 0 0 0.21059 0.684809
2 0.9 36.226837 -115.046648 1836.255238 0 0 0.21059 0.684809
     8          9        10  
0 41.150868 2016/3/26 4:35:51
1 41.150868 2016/3/26 4:35:51
2 41.150868 2016/3/26 4:35:51

each column has the proper dtype inferred by pandas

edited Apr 4, 2016 at 13:53

answered Apr 4, 2016 at 13:34

Francesco Nazzaro

2,93614 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Karmel Over a year ago

Hm, can I create two variables, load(with float values) and load2(with dtype=str) and then merge them? Cols 1-8 from load and 9-10 from load 2? And how could I do that?

hpaulj Over a year ago

With dtype=None you get the correct dtype, a mix of floats and strings, just like you'd get in pandas. But you have access the fields by name, not by column number. Look at the load.dtype.

Francesco Nazzaro Over a year ago

with dtype=None you get an array with shape (3,) and not (3, 11) as expected. With pandas the shape of the dataframe is as expected.

hpaulj Over a year ago

What do you get when you convert the dataframe to array?

Francesco Nazzaro Over a year ago

if you use load.as_matrix() you get an array with dtype=object

|

hpaulj · Accepted Answer · 2016-04-04 17:52:02Z

Applying an earlier genfromtxt answer to this case:

txt="""0.3,36.22683698,-115.0466482,1836.255238,0,0,0.2105903662,0.6848089322,41.15086807,2016/3/26,4:35:51
... ..."""
>>> load=np.genfromtxt(txt.splitlines(),dtype=None,delimiter=',')
>>> load.shape
(3,)
>>> load.dtype
dtype([('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<i4'), ('f5', '<i4'), ('f6', '<f8'), ('f7', '<f8'), ('f8', '<f8'), ('f9', 'S9'), ('f10', 'S7')])

The shape is 1d, but the dtype is compound, a mix of floats, ints and strings - 11 of them.

>>> load[0]
(0.3, 36.22683698, -115.0466482, 1836.255238, 0, 0, 0.2105903662, 0.6848089322, 41.15086807, '2016/3/26', '4:35:51')
>>> load['f0']
array([ 0.3,  0.6,  0.9])

'rows' or records are accessed by number, but 'columns' are now fields, and accessed by name (you can get the names from csv column headers as well, here they are generated automatically).

>>> load[0]['f4']
0
>>> load[0]['f3']
1836.255238

Individual elements are access by a combination of number and name.

A disadvantage of this structured array format is that the ability to do math across columns is limited. A way around this is to group like columns into another layer of compounding.

With this data I can define 5 fields, a mix of float, int and string:

>>> dt=np.dtype('(4)float,(2)int,(3)float,S10,S10')
>>> dt
dtype([('f0', '<f8', (4,)), ('f1', '<i4', (2,)), ('f2', '<f8', (3,)), ('f3', 'S10'), ('f4', 'S10')])
>>> load=np.genfromtxt(txt.splitlines(),dtype=dt,delimiter=',')

Now the first field is a (3,4) array:

>>> load['f0']
array([[  3.00000000e-01,   3.62268370e+01,  -1.15046648e+02,
          1.83625524e+03],
       [  6.00000000e-01,   3.62268370e+01,  -1.15046648e+02,
          1.83625524e+03],
       [  9.00000000e-01,   3.62268370e+01,  -1.15046648e+02,
          1.83625524e+03]])
>>> load['f1']
array([[0, 0],
       [0, 0],
       [0, 0]])

dt=np.dtype('(9)float,S10,S10') also works since the 2 int columns can load as floats.

The last 2 columns could be loaded as np.datetime64, though the comma separating them might complicate the step.

These 9 numeric columns can be extracted from a pandas load into a numpy float array with:

pload.values[:,:9].astype(float)
pload.as_matrix(range(9))

Community · Accepted Answer · 2017-05-23 12:22:47Z

0

You need to add names=True in np.genfromtext() There is a similar question at genfromtxt returning NaN rows

Take a look over there

edited May 23, 2017 at 12:22

CommunityBot

11 silver badge

answered Apr 4, 2016 at 13:35

renno

2,8472 gold badges36 silver badges59 bronze badges

Collectives™ on Stack Overflow

Numpy array while importing from CSV

3 Answers 3

7 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related