0

I have a CSV dataset that looks like this:

FirstAge,SecondAge,FirstCountry,SecondCountry,Income,NAME
41,41,USA,UK,113764,John
53,43,USA,USA,145963,Fred
47,37,USA,UK,42857,Dan
47,44,UK,USA,95352,Mark  

I'm trying to load it into Python 3.6 with this code:

>>> from numpy import genfromtxt

>>> my_data = genfromtxt('first.csv', delimiter=',')
>>> print(train_data)

Output:

 [[             nan              nan              nan              nan
               nan              nan]
 [  4.10000000e+01   4.10000000e+01              nan              nan
    1.13764000e+05              nan]
 [  5.30000000e+01   4.30000000e+01              nan              nan
    1.45963000e+05              nan]
 ..., 
 [  2.10000000e+01   3.00000000e+01              nan              nan
    1.19929000e+05              nan]
 [  6.90000000e+01   6.40000000e+01              nan              nan
    1.52667000e+05              nan]
 [  2.00000000e+01   1.90000000e+01              nan              nan
    1.05077000e+05              nan]]

I've looked at the Numpy docs and I don't see anything about this.

3
  • Is USA or UK a number ?! What's the problem you're facing? Commented Apr 17, 2017 at 2:09
  • 2
    The issue that you may be running into is the numpy wants to parse your data as a numeric type and this could be causing unexpected behavior. Commented Apr 17, 2017 at 2:11
  • The numeric columns/rows are right, just in float. The nan stand in for strings that can't be interpreted as floats. Commented Apr 17, 2017 at 3:08

5 Answers 5

2

Go with pandas, it will save you the trouble:

import pandas as pd

df = pd.read_csv('first.csv')
print(df)
Sign up to request clarification or add additional context in comments.

Comments

1

You could use the dtype argument:

import numpy as np

output = np.genfromtxt("main.csv", delimiter=',', skip_header=1, dtype='f, f, |S6, |S6, f, |S6')

print(output)

Output:

[( 41.,  41., b'USA', b'UK',  113764., b'John')
 ( 53.,  43., b'USA', b'USA',  145963., b'Fred')
 ( 47.,  37., b'USA', b'UK',   42857., b'Dan')
 ( 47.,  44., b'UK', b'USA',   95352., b'Mark')]

8 Comments

I'm to lazy to check, but i thought about skip_header as the first row looked that bad. In this case, using it, automatic inference should work too (no need to manually define it). But finally a good answer (which really answers the question).
I think numpy would infer it but it does not, it puts it like nan.
If you want to infer the types you could use pandas as shown in another answer.
Sure. I always use pandas for that kind of stuff. But i'm puzzled by numpy's behaviour (as this inference is really easy and np should do that too). But nevermind. This is a good answer, maybe a bit too compact, but compared to the others it shines.
Thanks for your comment.
|
1

Alternative from using pandas is to use csv library

import csv
import numpy as np
ls = list(csv.reader(open('first.csv', 'r')))
val_array = np.array(ls)[1::] # exclude first row (columns name)

2 Comments

I get an array of string dtype with your way.
Ah, yeah. You have to cast each to other types later if using csv.reader
1

With a few general paramters genfromtxt can read this file (in PY3 here):

In [100]: data = np.genfromtxt('stack43444219.txt', delimiter=',', names=True, dtype=None)
In [101]: data
Out[101]: 
array([(41, 41, b'USA', b'UK', 113764, b'John'),
       (53, 43, b'USA', b'USA', 145963, b'Fred'),
       (47, 37, b'USA', b'UK',  42857, b'Dan'),
       (47, 44, b'UK', b'USA',  95352, b'Mark')], 
      dtype=[('FirstAge', '<i4'), ('SecondAge', '<i4'), ('FirstCountry', 'S3'), ('SecondCountry', 'S3'), ('Income', '<i4'), ('NAME', 'S4')])

This is a structured array. 2 fields are integer, 2 are string (byte string by default), another integer, and string.

The default genfromtxt reads all lines as data. I uses names=True to get to use the first line a field names.

It also tries to read all strings a float (default dtype). The string columns then load as nan.

All of this is in the genfromtxt docs. Admittedly they are long, but they aren't hard to find.

Access fields by name, data['FirstName'] etc.


Using thecsv reader gives a 2d array of strings:

In [102]: ls =list(csv.reader(open('stack43444219.txt','r')))
In [103]: ls
Out[103]: 
[['FirstAge', 'SecondAge', 'FirstCountry', 'SecondCountry', 'Income', 'NAME'],
 ['41', '41', 'USA', 'UK', '113764', 'John'],
 ['53', '43', 'USA', 'USA', '145963', 'Fred'],
 ['47', '37', 'USA', 'UK', '42857', 'Dan'],
 ['47', '44', 'UK', 'USA', '95352', 'Mark']]
In [104]: arr=np.array(ls)
In [105]: arr
Out[105]: 
array([['FirstAge', 'SecondAge', 'FirstCountry', 'SecondCountry', 'Income',
        'NAME'],
       ['41', '41', 'USA', 'UK', '113764', 'John'],
       ['53', '43', 'USA', 'USA', '145963', 'Fred'],
       ['47', '37', 'USA', 'UK', '42857', 'Dan'],
       ['47', '44', 'UK', 'USA', '95352', 'Mark']], 
      dtype='<U13')

Comments

-1

I think the an issue that you could be running into is the data that you are trying to parse is not all numerics and this could potentially cause unexpected behavior.

One way to detect the types would be to try and identify the types before they are added to your array. For example:

for obj in my_data:
    if type(obj) == int:
        # process or add your data to numpy
    else:
        # cast or discard the data

5 Comments

Without wanting to attack someone (in some unfair way): this answer has a very very low quality, mostly indicates the source of the problem, does not give a cure and presents pseudo-code which is far far away from running (this is the most upsetting for me). I don't understand how to accept this one, especially 30 minutes after asking.
Sascha, you are correct in that I am pointing out the problem and not providing a defined solution. If it would be in the best interest of this post, I can remove my answer.
Well... not my decision. But the first statement is wrong (although that's numpy's specialty) as numpy can also store strings, even just for some columns in these record-arrays. And the pseudo-code is very bad and it's not really too hard to express your idea into code (probably using isinstance).
OK, valid points. I just tried to remove my answer and because it is the accepted one it cannot be removed.
Interesting. I also did not know that. So don't bother. We will see, if OP recognizes the other answers (2 good quality ones).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.