Numpy array from file- preallocating?

Question

I've got a set of large ascii data files that I need to get into a numpy array. By large, I mean 390 lines, where each line is 60,000 values (double values output with high precision from a C++ program) separated by a space.

Currently I am using the following (naive) code:

import numpy as np
data_array = np.genfromtxt('l_sim_s_data.txt')

However, this takes upwards of 25 seconds to run. I suspect it is due to not preallocating the data_array before reading the values in. Is there any way to tell genfromtxt the size of the array it is making (so memory would be preallocated)? Or does anyone have an idea on how to speed this process up?

For whatever it's worth, np.fromiter will pre-allocate if you give it a count argument. You can exploit this to make a np.loadtxt equivalent that pre-allocates the array. — Joe Kington
– Joe Kington, Commented Jul 10, 2011 at 20:32

Katriel · Accepted Answer · 2011-07-07 20:05:05Z

3

Have you tried np.loadtxt?

(genfromtxt is a more advanced file loader, which handles things like missing values and format converters.)

answered Jul 7, 2011 at 20:05

Katriel

124k19 gold badges141 silver badges172 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

MarkD Over a year ago

Somehow my comment from yesterday didn't show up. Thanks for the suggestion on np.loadtxt. It does make a difference, but not as big as I would like (25 seconds with np.genfromtxt goes to ~20 seconds with np.loadtxt). In C++, with a pre-allocated array, I am able to get the read time down to under a second. If there isn't a better way of doing this in python, I'll look at making a C++-based python module specifically for loading this data into a numpy array directly.

Katriel Over a year ago

@Mark: huh. I've rootled around in the numpy docs and you are right: loading into a preallocated array does seem to be rather missing. A thought: have you tried just loading the array in pure Python? Something like for line in file: data_array[i, :] = map(float, line.split()). The IO is pretty heavily buffered so this might actually be snappy.

MarkD Over a year ago

Thanks for that suggestion- it does make a big difference. Loading the file with the loop takes around 8-10 seconds. Not quite C++ speed, but now we're starting to get somewhere. ;)

Katriel Over a year ago

@Mark: you could look into doing the iteration using something like scipy.weave.inline to avoid having to make a C++ module.

Collectives™ on Stack Overflow

Numpy array from file- preallocating?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related