3

I've got a set of large ascii data files that I need to get into a numpy array. By large, I mean 390 lines, where each line is 60,000 values (double values output with high precision from a C++ program) separated by a space.

Currently I am using the following (naive) code:

import numpy as np
data_array = np.genfromtxt('l_sim_s_data.txt')

However, this takes upwards of 25 seconds to run. I suspect it is due to not preallocating the data_array before reading the values in. Is there any way to tell genfromtxt the size of the array it is making (so memory would be preallocated)? Or does anyone have an idea on how to speed this process up?

1
  • For whatever it's worth, np.fromiter will pre-allocate if you give it a count argument. You can exploit this to make a np.loadtxt equivalent that pre-allocates the array. Commented Jul 10, 2011 at 20:32

1 Answer 1

3

Have you tried np.loadtxt?

(genfromtxt is a more advanced file loader, which handles things like missing values and format converters.)

Sign up to request clarification or add additional context in comments.

4 Comments

Somehow my comment from yesterday didn't show up. Thanks for the suggestion on np.loadtxt. It does make a difference, but not as big as I would like (25 seconds with np.genfromtxt goes to ~20 seconds with np.loadtxt). In C++, with a pre-allocated array, I am able to get the read time down to under a second. If there isn't a better way of doing this in python, I'll look at making a C++-based python module specifically for loading this data into a numpy array directly.
@Mark: huh. I've rootled around in the numpy docs and you are right: loading into a preallocated array does seem to be rather missing. A thought: have you tried just loading the array in pure Python? Something like for line in file: data_array[i, :] = map(float, line.split()). The IO is pretty heavily buffered so this might actually be snappy.
Thanks for that suggestion- it does make a big difference. Loading the file with the loop takes around 8-10 seconds. Not quite C++ speed, but now we're starting to get somewhere. ;)
@Mark: you could look into doing the iteration using something like scipy.weave.inline to avoid having to make a C++ module.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.