1

I'm new to python and want the most pythonic way of solving the following basic problem:

I have many plain-text data files file.00001, file.00002, ..., file.99999 and each file has a single line, with numeric data stored in e.g. four columns. I want to read each file sequentially and append the data into one array per column, so in the end I want the arrays arr0, arr1, arr2, arr3 each with shape=(99999,) containing all the data from the appropriate column in all the files.

Later on I want to do lots of math with these arrays so I need to make sure that their entries are contiguous in memory. My naive solution is:

import numpy as np
fnumber = 99999
fnums = np.arange(1, fnumber+1)

arr0 = np.full_like(fnums, np.nan, dtype=np.double)
arr1 = np.full_like(fnums, np.nan, dtype=np.double)
arr2 = np.full_like(fnums, np.nan, dtype=np.double)
arr3 = np.full_like(fnums, np.nan, dtype=np.double)
# ...also is there a neat way of doing this??

for fnum in fnums:
    fname = f'path/to/data/folder/file.{fnum:05}'
    arr0[fnum-1], arr1[fnum-1], arr2[fnum-1], arr3[fnum-1] = np.loadtxt(fname, delimiter=' ', unpack=True)

# error checking - in case a file got deleted or something
all_arrs = (arr0, arr1, arr2, arr3)
if np.isnan(all_arrs).any():
    print("CUIDADO HAY NANS!!!!\nLOOK OUT, THERE ARE NANS!!!!")

It strikes me that this is very C-thinking and there probably is a more pythonic way of doing it. But my feeling is that methods like numpy.concatenate and numpy.insert would either not result in arrays with their contents contiguous in memory, or involve deep copies of each array at every step in the for loop, which would probably melt my laptop.

Is there a more pythonic way?

1
  • all_arrs is a tuple of arrays. np.isnan will turn that into one array (e.g. np.array(all_arrs), returning a boolean array of shape (4,100000?). Commented Nov 4, 2020 at 22:16

1 Answer 1

2

Try:

alist = []
for fnum in fnums:
    fname = f'path/to/data/folder/file.{fnum:05}'
    alist.append(np.loadtxt(fname))
arr = np.array(alist)
# arr = np.vstack(alist)    # alternative
print(arr.shape)

Assuming the files all have the same number of columns, one of these should work. The result will be one array, which you could separate into 4 if needed.

Sign up to request clarification or add additional context in comments.

4 Comments

How does this work? Inside the for loop you append to alist and then outside you recast(??) alist as an ndarray which takes care of the efficient memory allocation? Seems ok, except since numpy arrays are row-major aren't contiguous elements in a column 4 memory addresses apart rather than right next to each other? (Also your method stores the same data twice, once in alist and once in arr but I guess that's not a problem with only a few thousand numbers.
loadtxt loads the file line by line and builds the result from the resulting list of lists. It really doesn't matter whether you collect a list of arrays, and make an array from those, or assign those arrays to slices of a predefined array. Memory use and copying is basically the same. Remember a list contains references/pointers to objects else where in memory.
appending arrays to a list is quite efficient since it just adds a reference to the list. Doing concatenate iteratively is slow, but doing one copy at the end, joining all arrays into one is better. But feel free to compare alternatives (on smaller problems if needed).
My feeling is that this solution is definitely more pythonic, but involves a bit more passing the same data around multiple times. And if I assign my arr0 to arr3 by slicing your arr then it will leave them as arrays that aren't contiguous in memory. If I copy them it should be ok.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.