Read CSV files and append to new column of Numpy array

Question

I am reading a list of CSV files and always appending the data to a new column in my array. My current solution is analogous to the following:

import numpy as np

# Random generator and paths for the sake of reproducibility 
fake_read_csv = lambda path: np.random.random(5) 
paths = ['a','b','c','d']

first_iteration=True
for path in paths:
    print(f'Reading path {path}')
    sub = fake_read_csv(path)
    if first_iteration:
        first_iteration=False
        pred = sub
    else:
        pred = np.c_[pred, sub] # append to a new column
print(pred)

I was wondering if it is possible to simplify the loop. For example, something like this:

import numpy as np
fake_read_csv = lambda path: np.random.random(5)
paths = ['a','b','c','d']

pred = np.array([])
for path in paths:
    print(f'Reading path {path}')
    sub = fake_read_csv(path)
    pred = np.c_[pred, sub] # append to a new column

Which raises the error:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

Why do you want to do that in numpy? Maybe use something like this: stackoverflow.com/a/21232849/10197418 ? — FObersteiner
– FObersteiner, Commented Dec 6, 2019 at 14:10
@MrFuppes memory constraint, but thanks for the hint anyway! — Fernando Wittmann
– Fernando Wittmann, Commented Dec 6, 2019 at 14:14
@FernandoWittmann. The suggested method should use a lot less memory than what you're doing — Mad Physicist
– Mad Physicist, Commented Dec 6, 2019 at 14:20
Consider pandas, it's convenient for handling csv and tabular data. — Quang Hoang
– Quang Hoang, Commented Dec 6, 2019 at 14:28
@QuangHoang I am currently using pandas, however, in the end, I will have to convert to np.array in order to be used as input of a Keras model. As I might have memory constraints (each CSV has 1Gb), I am considering reading each file directly into a numpy array instead of reading everything as a pandas dataframe and then converting to numpy array later. — Fernando Wittmann
– Fernando Wittmann, Commented Dec 6, 2019 at 14:45

Mad Physicist · Accepted Answer · 2019-12-06 15:56:27Z

1

For starters, every time you append, an entirely new array is allocated, which is quite wasteful. Instead, you can just combine all your columns once they're loaded:

pred = np.array([fake_read_csv(path) for path in paths], order='F').T

The transpose makes the rows you read in into columns. order='F' will ensure that the memory layout of the transposed result is the same as the array in your question.

If you want you can preallocate the buffer, either by knowing the number of rows up front, or by loading the first array. Here's an example of the latter:

first = fake_read_csv(paths[0])
buffer = np.zeros((first.size, len(paths)))
buffer[:, 0] = first
for col, path in enumerate(paths[1:], start=1):
    buffer[:, col] = fake_read_csv(path)

If your concern is calling the reader function multiple times, you can allocate the array in the loop, like this:

buffer = None
for col, path in enumerate(paths):
    data = fake_read_csv(path)
    if buffer is None:
        buffer = np.zeros((data.size, len(paths)))
    buffer[:, col] = data

This option has the additional advantage that it does not reuire any extra checking to see if you get data.

edited Dec 6, 2019 at 15:56

answered Dec 6, 2019 at 14:16

Mad Physicist

116k29 gold badges202 silver badges292 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Fernando Wittmann Over a year ago

Thanks for the one-liner and the buffer! However, if I can't use list comprehension and I don't know the number of rows up front, then fake_read_csv will have to appear twice in the code, right?

Mad Physicist Over a year ago

@FernandoWittmann. You could convert the list comprehension into a for-loop, but the idea is that the first one loads all the columns separately at the same time, then concatenates them (using 2N memory), while the second one preallocates the buffer and only holds one additional column in memory at a time

Mad Physicist Over a year ago

@FernandoWittmann. I've added third option that does what the second one does, but only calls the reader inside the loop..

Collectives™ on Stack Overflow

Read CSV files and append to new column of Numpy array

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related