1

I have a huge file csv file with around 4 million column and around 300 rows. File size is about 4.3G. I want to read this file and run some machine learning algorithm on the data.

I tried reading the file via pandas read_csv in python but it is taking long time for reading even a single row ( I suspect due to large number of columns ). I checked few other options like numpy fromfile, but nothing seems to be working.

Can someone please suggest some way to load file with many columns in python?

4
  • 4
    csv is very inefficient for storing large datasets. You should convert your csv file into a better suited format. Try hdf5 (h5py.org or pytables.org), it is very fast and allows you to read parts of the dataset without fully loading it into memory. Commented Jun 29, 2017 at 21:28
  • 1
    Thanks for suggestion. I will try create h5py file. The file I am loading is being generated by C++ code. So I will check if H5py has any C++ API. Commented Jun 29, 2017 at 21:32
  • There is an official c++ hdf5 API. In fact the python libraries are just bindings for it. Commented Jun 29, 2017 at 21:33
  • That's cool then. It perfectly fits my requirements. Can you add this suggestion as an answer. I will accept it. Commented Jun 29, 2017 at 21:35

3 Answers 3

3

Pandas/numpy should be able to handle that volume of data no problem. I hope you have at least 8GB of RAM on that machine. To import a CSV file with Numpy, try something like

data = np.loadtxt('test.csv', dtype=np.uint8, delimiter=',')

If there is missing data, np.genfromtext might work instead. If none of these meet your needs and you have enough RAM to hold a duplicate of the data temporarily, you could first build a Python list of lists, one per row using readline and str.split. Then pass that to Pandas or numpy, assuming that's how you intend to operate on the data. You could then save it to disk in a format for easier ingestion later. hdf5 was already mentioned and is a good option. You can also save a numpy array to disk with numpy.savez or my favorite the speedy bloscpack.(un)pack_ndarray_file.

Sign up to request clarification or add additional context in comments.

1 Comment

I have 128 gb of RAM on the machine. I will try converting it to hdf5 and will check if it works.
2

csv is very inefficient for storing large datasets. You should convert your csv file into a better suited format. Try hdf5 (h5py.org or pytables.org), it is very fast and allows you to read parts of the dataset without fully loading it into memory.

Comments

-2

According to this answer, pandas (which you already tried) is the fastest library available to read a CSV in Python, or at least was in 2014.

2 Comments

Not quite answering my question.
You asked an XY question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.