3

I have a problem and I don't know how to handle it.

I have CSV file like this:

0.3,36.22683698,-115.0466482,1836.255238,0,0,0.2105903662,0.6848089322,41.15086807,2016/3/26,4:35:51
0.6,36.22683698,-115.0466482,1836.255238,0,0,0.2105903662,0.6848089322,41.15086807,2016/3/26,4:35:51
0.9,36.22683698,-115.0466482,1836.255238,0,0,0.2105903662,0.6848089322,41.15086807,2016/3/26,4:35:51

As you can see, first I have 9 float values and then 2 other which I would like load as string. Delimiter = ,

When I use:

load = np.genfromtxt(str(path), delimiter=',')
print load[0,4]

it prints value from row 0, column 4 and it works. Data is loaded properly. But there is a problem, because last 2 values are nan

print load[0,10]
>>nan

When I change my code into this:

load = np.genfromtxt(str(path), delimiter=',',dtype=None)

I get error:

print load[0,4]
IndexError: too many indices for array

So everything works unless I add dtype=None

What I'm doing wrong?

2
  • The are lots of questions about genfromtxt producing a 1d array. Posters don't realize it has a compound dtype. Read about structured arrays. Commented Apr 4, 2016 at 15:12
  • See stackoverflow.com/q/35699886/901925 Commented Apr 4, 2016 at 15:31

3 Answers 3

3

You can't create numpy array with several dtype. You have to import your csv with dtype=str

import numpy as np
load = np.genfromtxt(str(path), delimiter=',',dtype=str)

With dtype=None it creates a numpy array with shape (3,). So you can't call load[0, 4].

Each entry is a tuple with your data. Because tuples can contain several types.

maybe for your purpose you have to use pandas:

import pandas as pd
load = pd.read_csv(str(path), header=None)

the output is the following:

0          1           2            3   4   5        6         7   \

0 0.3 36.226837 -115.046648 1836.255238 0 0 0.21059 0.684809
1 0.6 36.226837 -115.046648 1836.255238 0 0 0.21059 0.684809
2 0.9 36.226837 -115.046648 1836.255238 0 0 0.21059 0.684809

     8          9        10  

0 41.150868 2016/3/26 4:35:51
1 41.150868 2016/3/26 4:35:51
2 41.150868 2016/3/26 4:35:51

each column has the proper dtype inferred by pandas

Sign up to request clarification or add additional context in comments.

7 Comments

Hm, can I create two variables, load(with float values) and load2(with dtype=str) and then merge them? Cols 1-8 from load and 9-10 from load 2? And how could I do that?
With dtype=None you get the correct dtype, a mix of floats and strings, just like you'd get in pandas. But you have access the fields by name, not by column number. Look at the load.dtype.
with dtype=None you get an array with shape (3,) and not (3, 11) as expected. With pandas the shape of the dataframe is as expected.
What do you get when you convert the dataframe to array?
if you use load.as_matrix() you get an array with dtype=object
|
1

Applying an earlier genfromtxt answer to this case:

txt="""0.3,36.22683698,-115.0466482,1836.255238,0,0,0.2105903662,0.6848089322,41.15086807,2016/3/26,4:35:51
... ..."""
>>> load=np.genfromtxt(txt.splitlines(),dtype=None,delimiter=',')
>>> load.shape
(3,)
>>> load.dtype
dtype([('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<i4'), ('f5', '<i4'), ('f6', '<f8'), ('f7', '<f8'), ('f8', '<f8'), ('f9', 'S9'), ('f10', 'S7')])

The shape is 1d, but the dtype is compound, a mix of floats, ints and strings - 11 of them.

>>> load[0]
(0.3, 36.22683698, -115.0466482, 1836.255238, 0, 0, 0.2105903662, 0.6848089322, 41.15086807, '2016/3/26', '4:35:51')
>>> load['f0']
array([ 0.3,  0.6,  0.9])

'rows' or records are accessed by number, but 'columns' are now fields, and accessed by name (you can get the names from csv column headers as well, here they are generated automatically).

>>> load[0]['f4']
0
>>> load[0]['f3']
1836.255238

Individual elements are access by a combination of number and name.

A disadvantage of this structured array format is that the ability to do math across columns is limited. A way around this is to group like columns into another layer of compounding.

With this data I can define 5 fields, a mix of float, int and string:

>>> dt=np.dtype('(4)float,(2)int,(3)float,S10,S10')
>>> dt
dtype([('f0', '<f8', (4,)), ('f1', '<i4', (2,)), ('f2', '<f8', (3,)), ('f3', 'S10'), ('f4', 'S10')])
>>> load=np.genfromtxt(txt.splitlines(),dtype=dt,delimiter=',')

Now the first field is a (3,4) array:

>>> load['f0']
array([[  3.00000000e-01,   3.62268370e+01,  -1.15046648e+02,
          1.83625524e+03],
       [  6.00000000e-01,   3.62268370e+01,  -1.15046648e+02,
          1.83625524e+03],
       [  9.00000000e-01,   3.62268370e+01,  -1.15046648e+02,
          1.83625524e+03]])
>>> load['f1']
array([[0, 0],
       [0, 0],
       [0, 0]])

dt=np.dtype('(9)float,S10,S10') also works since the 2 int columns can load as floats.

The last 2 columns could be loaded as np.datetime64, though the comma separating them might complicate the step.

These 9 numeric columns can be extracted from a pandas load into a numpy float array with:

pload.values[:,:9].astype(float)
pload.as_matrix(range(9))

Comments

0

You need to add names=True in np.genfromtext() There is a similar question at genfromtxt returning NaN rows

Take a look over there

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.