Reading file with huge number of columns in python

Question

I have a huge file csv file with around 4 million column and around 300 rows. File size is about 4.3G. I want to read this file and run some machine learning algorithm on the data.

I tried reading the file via pandas read_csv in python but it is taking long time for reading even a single row ( I suspect due to large number of columns ). I checked few other options like numpy fromfile, but nothing seems to be working.

Can someone please suggest some way to load file with many columns in python?

csv is very inefficient for storing large datasets. You should convert your csv file into a better suited format. Try hdf5 (h5py.org or pytables.org), it is very fast and allows you to read parts of the dataset without fully loading it into memory. — Johannes
– Johannes, Commented Jun 29, 2017 at 21:28
Thanks for suggestion. I will try create h5py file. The file I am loading is being generated by C++ code. So I will check if H5py has any C++ API. — Mayank Jain
– Mayank Jain, Commented Jun 29, 2017 at 21:32
There is an official c++ hdf5 API. In fact the python libraries are just bindings for it. — Johannes
– Johannes, Commented Jun 29, 2017 at 21:33
That's cool then. It perfectly fits my requirements. Can you add this suggestion as an answer. I will accept it. — Mayank Jain
– Mayank Jain, Commented Jun 29, 2017 at 21:35

Joseph Sheedy · Accepted Answer · 2017-06-30 16:39:14Z

3

Pandas/numpy should be able to handle that volume of data no problem. I hope you have at least 8GB of RAM on that machine. To import a CSV file with Numpy, try something like

data = np.loadtxt('test.csv', dtype=np.uint8, delimiter=',')

If there is missing data, np.genfromtext might work instead. If none of these meet your needs and you have enough RAM to hold a duplicate of the data temporarily, you could first build a Python list of lists, one per row using readline and str.split. Then pass that to Pandas or numpy, assuming that's how you intend to operate on the data. You could then save it to disk in a format for easier ingestion later. hdf5 was already mentioned and is a good option. You can also save a numpy array to disk with numpy.savez or my favorite the speedy bloscpack.(un)pack_ndarray_file.

edited Jun 30, 2017 at 16:39

answered Jun 29, 2017 at 21:44

Joseph Sheedy

6,7604 gold badges34 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mayank Jain Over a year ago

I have 128 gb of RAM on the machine. I will try converting it to hdf5 and will check if it works.

Johannes · Accepted Answer · 2017-06-29 21:37:20Z

2

csv is very inefficient for storing large datasets. You should convert your csv file into a better suited format. Try hdf5 (h5py.org or pytables.org), it is very fast and allows you to read parts of the dataset without fully loading it into memory.

answered Jun 29, 2017 at 21:37

Johannes

3,4283 gold badges22 silver badges37 bronze badges

Comments

N Shumway · Accepted Answer · 2017-06-29 21:29:06Z

-2

According to this answer, pandas (which you already tried) is the fastest library available to read a CSV in Python, or at least was in 2014.

answered Jun 29, 2017 at 21:29

N Shumway

631 silver badge7 bronze badges

2 Comments

Mayank Jain Over a year ago

Not quite answering my question.

N Shumway Over a year ago

You asked an XY question.

Collectives™ on Stack Overflow

Reading file with huge number of columns in python

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related