Importing CSV into Python

Question

I have a CSV dataset that looks like this:

FirstAge,SecondAge,FirstCountry,SecondCountry,Income,NAME
41,41,USA,UK,113764,John
53,43,USA,USA,145963,Fred
47,37,USA,UK,42857,Dan
47,44,UK,USA,95352,Mark

I'm trying to load it into Python 3.6 with this code:

>>> from numpy import genfromtxt

>>> my_data = genfromtxt('first.csv', delimiter=',')
>>> print(train_data)

Output:

 [[             nan              nan              nan              nan
               nan              nan]
 [  4.10000000e+01   4.10000000e+01              nan              nan
    1.13764000e+05              nan]
 [  5.30000000e+01   4.30000000e+01              nan              nan
    1.45963000e+05              nan]
 ..., 
 [  2.10000000e+01   3.00000000e+01              nan              nan
    1.19929000e+05              nan]
 [  6.90000000e+01   6.40000000e+01              nan              nan
    1.52667000e+05              nan]
 [  2.00000000e+01   1.90000000e+01              nan              nan
    1.05077000e+05              nan]]

I've looked at the Numpy docs and I don't see anything about this.

Is USA or UK a number ?! What's the problem you're facing? — Pedro Lobito
– Pedro Lobito, Commented Apr 17, 2017 at 2:09
The issue that you may be running into is the numpy wants to parse your data as a numeric type and this could be causing unexpected behavior. — AgnosticDev
– AgnosticDev, Commented Apr 17, 2017 at 2:11
The numeric columns/rows are right, just in float. The nan stand in for strings that can't be interpreted as floats. — hpaulj
– hpaulj, Commented Apr 17, 2017 at 3:08

zipa · Accepted Answer · 2017-04-17 02:13:32Z

2

Go with pandas, it will save you the trouble:

import pandas as pd

df = pd.read_csv('first.csv')
print(df)

answered Apr 17, 2017 at 2:13

zipa

28k6 gold badges45 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

eyllanesc · Accepted Answer · 2017-04-17 02:38:02Z

1

You could use the dtype argument:

import numpy as np

output = np.genfromtxt("main.csv", delimiter=',', skip_header=1, dtype='f, f, |S6, |S6, f, |S6')

print(output)

Output:

[( 41.,  41., b'USA', b'UK',  113764., b'John')
 ( 53.,  43., b'USA', b'USA',  145963., b'Fred')
 ( 47.,  37., b'USA', b'UK',   42857., b'Dan')
 ( 47.,  44., b'UK', b'USA',   95352., b'Mark')]

answered Apr 17, 2017 at 2:38

eyllanesc

246k19 gold badges205 silver badges282 bronze badges

8 Comments

sascha Over a year ago

I'm to lazy to check, but i thought about skip_header as the first row looked that bad. In this case, using it, automatic inference should work too (no need to manually define it). But finally a good answer (which really answers the question).

eyllanesc Over a year ago

I think numpy would infer it but it does not, it puts it like nan.

eyllanesc Over a year ago

If you want to infer the types you could use pandas as shown in another answer.

sascha Over a year ago

Sure. I always use pandas for that kind of stuff. But i'm puzzled by numpy's behaviour (as this inference is really easy and np should do that too). But nevermind. This is a good answer, maybe a bit too compact, but compared to the others it shines.

eyllanesc Over a year ago

Thanks for your comment.

|

titipata · Accepted Answer · 2017-04-17 02:42:56Z

1

Alternative from using pandas is to use csv library

import csv
import numpy as np
ls = list(csv.reader(open('first.csv', 'r')))
val_array = np.array(ls)[1::] # exclude first row (columns name)

edited Apr 17, 2017 at 2:42

answered Apr 17, 2017 at 2:30

titipata

5,3894 gold badges39 silver badges59 bronze badges

2 Comments

hpaulj Over a year ago

I get an array of string dtype with your way.

titipata Over a year ago

Ah, yeah. You have to cast each to other types later if using csv.reader

hpaulj · Accepted Answer · 2017-04-17 03:03:09Z

With a few general paramters genfromtxt can read this file (in PY3 here):

In [100]: data = np.genfromtxt('stack43444219.txt', delimiter=',', names=True, dtype=None)
In [101]: data
Out[101]: 
array([(41, 41, b'USA', b'UK', 113764, b'John'),
       (53, 43, b'USA', b'USA', 145963, b'Fred'),
       (47, 37, b'USA', b'UK',  42857, b'Dan'),
       (47, 44, b'UK', b'USA',  95352, b'Mark')], 
      dtype=[('FirstAge', '<i4'), ('SecondAge', '<i4'), ('FirstCountry', 'S3'), ('SecondCountry', 'S3'), ('Income', '<i4'), ('NAME', 'S4')])

This is a structured array. 2 fields are integer, 2 are string (byte string by default), another integer, and string.

The default genfromtxt reads all lines as data. I uses names=True to get to use the first line a field names.

It also tries to read all strings a float (default dtype). The string columns then load as nan.

All of this is in the genfromtxt docs. Admittedly they are long, but they aren't hard to find.

Access fields by name, data['FirstName'] etc.

Using thecsv reader gives a 2d array of strings:

In [102]: ls =list(csv.reader(open('stack43444219.txt','r')))
In [103]: ls
Out[103]: 
[['FirstAge', 'SecondAge', 'FirstCountry', 'SecondCountry', 'Income', 'NAME'],
 ['41', '41', 'USA', 'UK', '113764', 'John'],
 ['53', '43', 'USA', 'USA', '145963', 'Fred'],
 ['47', '37', 'USA', 'UK', '42857', 'Dan'],
 ['47', '44', 'UK', 'USA', '95352', 'Mark']]
In [104]: arr=np.array(ls)
In [105]: arr
Out[105]: 
array([['FirstAge', 'SecondAge', 'FirstCountry', 'SecondCountry', 'Income',
        'NAME'],
       ['41', '41', 'USA', 'UK', '113764', 'John'],
       ['53', '43', 'USA', 'USA', '145963', 'Fred'],
       ['47', '37', 'USA', 'UK', '42857', 'Dan'],
       ['47', '44', 'UK', 'USA', '95352', 'Mark']], 
      dtype='<U13')

AgnosticDev · Accepted Answer · 2017-04-17 03:02:20Z

-1

I think the an issue that you could be running into is the data that you are trying to parse is not all numerics and this could potentially cause unexpected behavior.

One way to detect the types would be to try and identify the types before they are added to your array. For example:

for obj in my_data:
    if type(obj) == int:
        # process or add your data to numpy
    else:
        # cast or discard the data

edited Apr 17, 2017 at 3:02

answered Apr 17, 2017 at 2:15

AgnosticDev

1,8632 gold badges20 silver badges37 bronze badges

5 Comments

sascha Over a year ago

Without wanting to attack someone (in some unfair way): this answer has a very very low quality, mostly indicates the source of the problem, does not give a cure and presents pseudo-code which is far far away from running (this is the most upsetting for me). I don't understand how to accept this one, especially 30 minutes after asking.

AgnosticDev Over a year ago

Sascha, you are correct in that I am pointing out the problem and not providing a defined solution. If it would be in the best interest of this post, I can remove my answer.

sascha Over a year ago

Well... not my decision. But the first statement is wrong (although that's numpy's specialty) as numpy can also store strings, even just for some columns in these record-arrays. And the pseudo-code is very bad and it's not really too hard to express your idea into code (probably using isinstance).

AgnosticDev Over a year ago

OK, valid points. I just tried to remove my answer and because it is the accepted one it cannot be removed.

sascha Over a year ago

Interesting. I also did not know that. So don't bother. We will see, if OP recognizes the other answers (2 good quality ones).

Collectives™ on Stack Overflow

Importing CSV into Python

5 Answers 5

Comments

8 Comments

2 Comments

Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

8 Comments

2 Comments

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related