1

I am reading a list of CSV files and always appending the data to a new column in my array. My current solution is analogous to the following:

import numpy as np

# Random generator and paths for the sake of reproducibility 
fake_read_csv = lambda path: np.random.random(5) 
paths = ['a','b','c','d']

first_iteration=True
for path in paths:
    print(f'Reading path {path}')
    sub = fake_read_csv(path)
    if first_iteration:
        first_iteration=False
        pred = sub
    else:
        pred = np.c_[pred, sub] # append to a new column
print(pred)

I was wondering if it is possible to simplify the loop. For example, something like this:

import numpy as np
fake_read_csv = lambda path: np.random.random(5)
paths = ['a','b','c','d']

pred = np.array([])
for path in paths:
    print(f'Reading path {path}')
    sub = fake_read_csv(path)
    pred = np.c_[pred, sub] # append to a new column

Which raises the error:

ValueError: all the input array dimensions except for the concatenation axis must match exactly
9
  • Why do you want to do that in numpy? Maybe use something like this: stackoverflow.com/a/21232849/10197418 ? Commented Dec 6, 2019 at 14:10
  • @MrFuppes memory constraint, but thanks for the hint anyway! Commented Dec 6, 2019 at 14:14
  • @FernandoWittmann. The suggested method should use a lot less memory than what you're doing Commented Dec 6, 2019 at 14:20
  • Consider pandas, it's convenient for handling csv and tabular data. Commented Dec 6, 2019 at 14:28
  • @QuangHoang I am currently using pandas, however, in the end, I will have to convert to np.array in order to be used as input of a Keras model. As I might have memory constraints (each CSV has 1Gb), I am considering reading each file directly into a numpy array instead of reading everything as a pandas dataframe and then converting to numpy array later. Commented Dec 6, 2019 at 14:45

1 Answer 1

1

For starters, every time you append, an entirely new array is allocated, which is quite wasteful. Instead, you can just combine all your columns once they're loaded:

pred = np.array([fake_read_csv(path) for path in paths], order='F').T

The transpose makes the rows you read in into columns. order='F' will ensure that the memory layout of the transposed result is the same as the array in your question.

If you want you can preallocate the buffer, either by knowing the number of rows up front, or by loading the first array. Here's an example of the latter:

first = fake_read_csv(paths[0])
buffer = np.zeros((first.size, len(paths)))
buffer[:, 0] = first
for col, path in enumerate(paths[1:], start=1):
    buffer[:, col] = fake_read_csv(path)

If your concern is calling the reader function multiple times, you can allocate the array in the loop, like this:

buffer = None
for col, path in enumerate(paths):
    data = fake_read_csv(path)
    if buffer is None:
        buffer = np.zeros((data.size, len(paths)))
    buffer[:, col] = data

This option has the additional advantage that it does not reuire any extra checking to see if you get data.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the one-liner and the buffer! However, if I can't use list comprehension and I don't know the number of rows up front, then fake_read_csv will have to appear twice in the code, right?
@FernandoWittmann. You could convert the list comprehension into a for-loop, but the idea is that the first one loads all the columns separately at the same time, then concatenates them (using 2N memory), while the second one preallocates the buffer and only holds one additional column in memory at a time
@FernandoWittmann. I've added third option that does what the second one does, but only calls the reader inside the loop..

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.