How do I read CSV data into a record array in NumPy?

Question

Is there a direct way to import the contents of a CSV file into a record array, just like how R's read.table(), read.delim(), and read.csv() import data into R dataframes?

Or should I use csv.reader() and then apply numpy.core.records.fromrecords()?

Mateen Ulhaq · Accepted Answer · 2022-06-13 07:56:55Z

889

Use numpy.genfromtxt() by setting the delimiter kwarg to a comma:

from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')

edited Jun 13, 2022 at 7:56

Mateen Ulhaq

27.8k21 gold badges121 silver badges155 bronze badges

answered Aug 19, 2010 at 6:34

Andrew

13.2k2 gold badges28 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

CGTheLegend Over a year ago

What if you want something of different types? Like strings and ints?

chickensoup Over a year ago

@CGTheLegend np.genfromtxt('myfile.csv',delimiter=',',dtype=None)

Yibo Yang Over a year ago

numpy.loadtxt worked pretty well for me too

hhh Over a year ago

I tried this but I am only getting nan values, why? Also with loadtxt, I am getting UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 155: ordinal not in range(128). I have umlauts such as ä and ö in the input data.

kolen Over a year ago

@hhh try adding encoding="utf8" argument. Python is one of the few modern software pieces that frequently causes text encoding problems, which feel as things from the past.

|

Mateen Ulhaq · Accepted Answer · 2022-07-29 07:54:02Z

249

Use pandas.read_csv:

import pandas as pd
df = pd.read_csv('myfile.csv', sep=',', header=None)
print(df.values)

array([[ 1. ,  2. ,  3. ],
       [ 4. ,  5.5,  6. ]])

This gives a pandas DataFrame which provides many useful data manipulation functions which are not directly available with numpy record arrays.

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table...

I would also recommend numpy.genfromtxt. However, since the question asks for a record array, as opposed to a normal array, the dtype=None parameter needs to be added to the genfromtxt call:

import numpy as np
np.genfromtxt('myfile.csv', delimiter=',')

For the following 'myfile.csv':

1.0, 2, 3
4, 5.5, 6

the code above gives an array:

array([[ 1. ,  2. ,  3. ],
       [ 4. ,  5.5,  6. ]])

and

np.genfromtxt('myfile.csv', delimiter=',', dtype=None)

gives a record array:

array([(1.0, 2.0, 3), (4.0, 5.5, 6)], 
      dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<i4')])

This has the advantage that files with multiple data types (including strings) can be easily imported.

edited Jul 29, 2022 at 7:54

Mateen Ulhaq

27.8k21 gold badges121 silver badges155 bronze badges

answered Oct 10, 2014 at 9:30

Lee

31.4k31 gold badges124 silver badges187 bronze badges

5 Comments

Viet Over a year ago

read_csv works with commas inside quotes. Recommend this over genfromtxt

c-chavez Over a year ago

use header=0 to skip the first line in the values, if your file has a 1-line header

Newskooler Over a year ago

Bear in mind that this creates a 2d array: e.g. (1000, 1). np.genfromtxt does not do that: e.g. (1000,).

José L. Patiño Over a year ago

The OP is asking for Numpy arrays, not about Pandas Dataframe objects.

Lee Over a year ago

@JoséL.Patiño The second part of the question deals with the request for a Numpy record array. The first part of the answer shows df.values which gives a Numpy representation of the DataFrame; a convenient method imho.

Omid · Accepted Answer · 2021-12-07 05:02:06Z

94

I tried it :

from numpy import genfromtxt
genfromtxt(fname = dest_file, dtype = (<whatever options>))

versus :

import csv
import numpy as np
with open(dest_file,'r') as dest_f:
    data_iter = csv.reader(dest_f,
                           delimiter = delimiter,
                           quotechar = '"')
    data = [data for data in data_iter]
data_array = np.asarray(data, dtype = <whatever options>)

on 4.6 million rows with about 70 columns and found that the NumPy path took 2 min 16 secs and the csv-list comprehension method took 13 seconds.

I would recommend the csv-list comprehension method as it is most likely relies on pre-compiled libraries and not the interpreter as much as NumPy. I suspect the pandas method would have similar interpreter overhead.

edited Dec 7, 2021 at 5:02

Omid

335 bronze badges

answered Feb 17, 2015 at 3:52

William komp

1,2579 silver badges4 bronze badges

2 Comments

Matthias Fripp Over a year ago

I tested code similar to this with a csv file containing 2.6 million rows and 8 columns. numpy.recfromcsv() took about 45 seconds, np.asarray(list(csv.reader())) took about 7 seconds, and pandas.read_csv() took about 2 seconds (!). (The file had recently been read from disk in all cases, so it was already in the operating system's file cache.) I think I'll go with pandas.

Matthias Fripp Over a year ago

I just noticed there are some notes about the design of pandas' fast csv parser at wesmckinney.com/blog/… . The author takes speed and memory requirements pretty seriously. It's also possible to use as_recarray=True to get the result directly as a Python record array rather than a pandas dataframe.

jkmartindale · Accepted Answer · 2020-10-26 08:49:15Z

70

You can also try recfromcsv() which can guess data types and return a properly formatted record array.

edited Oct 26, 2020 at 8:49

jkmartindale

5662 gold badges9 silver badges25 bronze badges

answered Jan 18, 2011 at 12:44

btel

5,6936 gold badges40 silver badges48 bronze badges

1 Comment

eacousineau Over a year ago

If you want to maintain ordering / column names in the CSV, you can use the following invocation: numpy.recfromcsv(fname, delimiter=',', filling_values=numpy.nan, case_sensitive=True, deletechars='', replace_space=' ') The key arguments are the last three.

Peter Mortensen · Accepted Answer · 2018-07-15 08:29:04Z

As I tried both ways using NumPy and Pandas, using pandas has a lot of advantages:

Faster
Less CPU usage
1/3 RAM usage compared to NumPy genfromtxt

This is my test code:

$ for f in test_pandas.py test_numpy_csv.py ; do  /usr/bin/time python $f; done
2.94user 0.41system 0:03.05elapsed 109%CPU (0avgtext+0avgdata 502068maxresident)k
0inputs+24outputs (0major+107147minor)pagefaults 0swaps

23.29user 0.72system 0:23.72elapsed 101%CPU (0avgtext+0avgdata 1680888maxresident)k
0inputs+0outputs (0major+416145minor)pagefaults 0swaps

test_numpy_csv.py

from numpy import genfromtxt
train = genfromtxt('/home/hvn/me/notebook/train.csv', delimiter=',')

test_pandas.py

from pandas import read_csv
df = read_csv('/home/hvn/me/notebook/train.csv')

Data file:

du -h ~/me/notebook/train.csv
 59M    /home/hvn/me/notebook/train.csv

With NumPy and pandas at versions:

$ pip freeze | egrep -i 'pandas|numpy'
numpy==1.13.3
pandas==0.20.2

Guillaume Jacquenot · Accepted Answer · 2018-05-24 16:42:46Z

10

Using numpy.loadtxt

A quite simple method. But it requires all the elements being float (int and so on)

import numpy as np 
data = np.loadtxt('c:\\1.csv',delimiter=',',skiprows=0)

edited May 24, 2018 at 16:42

Guillaume Jacquenot

11.8k6 gold badges45 silver badges50 bronze badges

answered Jan 30, 2018 at 11:34

Xiaojian Chen

1891 silver badge8 bronze badges

1 Comment

Konstantin F Over a year ago

Also can use this: ''' data2 = np.genfromtxt(''c:\\1.csv', delimiter=',') '''

Peter Mortensen · Accepted Answer · 2018-07-15 08:27:15Z

7

You can use this code to send CSV file data into an array:

import numpy as np
csv = np.genfromtxt('test.csv', delimiter=",")
print(csv)

edited Jul 15, 2018 at 8:27

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Jun 21, 2017 at 7:52

chamzz.dot

7752 gold badges12 silver badges25 bronze badges

Comments

Butiri Dan · Accepted Answer · 2019-08-25 18:04:53Z

7

This work as a charm...

import csv
with open("data.csv", 'r') as f:
    data = list(csv.reader(f, delimiter=";"))

import numpy as np
data = np.array(data, dtype=np.float)

edited Aug 25, 2019 at 18:04

Butiri Dan

1,7706 gold badges15 silver badges18 bronze badges

answered Aug 25, 2019 at 17:18

Nihal Sargaiya

791 silver badge3 bronze badges

Comments

LayneSadler · Accepted Answer · 2020-09-26 12:10:20Z

7

This is the easiest way:

import csv
with open('testfile.csv', newline='') as csvfile:
    data = list(csv.reader(csvfile))

Now each entry in data is a record, represented as an array. So you have a 2D array. It saved me so much time.

edited Sep 26, 2020 at 12:10

LayneSadler

6,0926 gold badges54 silver badges89 bronze badges

answered Jun 13, 2018 at 21:00

matthewpark319

1,2971 gold badge15 silver badges17 bronze badges

1 Comment

Chris Over a year ago

Why should we have to screw around with Pandas, when these tools have so much less feature bloat?

Peter Mortensen · Accepted Answer · 2018-07-15 08:30:23Z

6

I would suggest using tables (pip3 install tables). You can save your .csv file to .h5 using pandas (pip3 install pandas),

import pandas as pd
data = pd.read_csv("dataset.csv")
store = pd.HDFStore('dataset.h5')
store['mydata'] = data
store.close()

You can then easily, and with less time even for huge amount of data, load your data in a NumPy array.

import pandas as pd
store = pd.HDFStore('dataset.h5')
data = store['mydata']
store.close()

# Data in NumPy format
data = data.values

edited Jul 15, 2018 at 8:30

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Jun 22, 2018 at 9:39

Jatin Mandav

511 silver badge9 bronze badges

Comments

Mokhamad Arfan Wicaksono · Accepted Answer · 2021-09-03 15:41:06Z

6

Available on the newest pandas and numpy version.

import pandas as pd
import numpy as np

data = pd.read_csv('data.csv', header=None)

# Discover, visualize, and preprocess data using pandas if needed.

data = data.to_numpy()

edited Sep 3, 2021 at 15:41

answered Aug 26, 2021 at 3:25

Mokhamad Arfan Wicaksono

3254 silver badges7 bronze badges

Comments

Hamid Rouhani · Accepted Answer · 2017-08-12 19:45:07Z

4

I tried this:

import pandas as p
import numpy as n

closingValue = p.read_csv("<FILENAME>", usecols=[4], dtype=float)
print(closingValue)

edited Aug 12, 2017 at 19:45

Hamid Rouhani

2,4593 gold badges34 silver badges51 bronze badges

answered Aug 3, 2017 at 8:02

muTheTechie

1,70319 silver badges26 bronze badges

Comments

kdurant · Accepted Answer · 2021-01-13 04:19:13Z

0

In [329]: %time my_data = genfromtxt('one.csv', delimiter=',')
CPU times: user 19.8 s, sys: 4.58 s, total: 24.4 s
Wall time: 24.4 s

In [330]: %time df = pd.read_csv("one.csv", skiprows=20)
CPU times: user 1.06 s, sys: 312 ms, total: 1.38 s
Wall time: 1.38 s

answered Jan 13, 2021 at 4:19

kdurant

1

1 Comment

Ruli Over a year ago

Please edit the question with some more information about your solution.

Ovu Sunday · Accepted Answer · 2022-08-02 01:28:41Z

-1

this is a very simple task, the best way to do this is as follows

import pandas as pd
import numpy as np


df = pd.read_csv(r'C:\Users\Ron\Desktop\Clients.csv')   #read the file (put 'r' before the path string to address any special characters in the file such as \). Don't forget to put the file name at the end of the path + ".csv"

print(df)`

y = np.array(df)

edited Aug 2, 2022 at 1:28

answered Aug 2, 2022 at 1:19

Ovu Sunday

92 bronze badges

2 Comments

user3503711 Over a year ago

The OP asked to read directly to numpy array. Reading it as a dataframe and converting it to numpy array requires more storage and time.

Ovu Sunday Over a year ago

Yes, that's correct. But I just gave another possible way of doing the same thing, if the above doesn't work

Collectives™ on Stack Overflow

How do I read CSV data into a record array in NumPy?

14 Answers 14

8 Comments

5 Comments

2 Comments

1 Comment

test_numpy_csv.py

test_pandas.py

Data file:

Comments

1 Comment

Comments

Comments

1 Comment

Comments

Comments

Comments

1 Comment

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

14 Answers 14

8 Comments

5 Comments

2 Comments

1 Comment

test_numpy_csv.py

test_pandas.py

Data file:

Comments

1 Comment

Comments

Comments

1 Comment

Comments

Comments

Comments

1 Comment

2 Comments

Linked

Related