2

I have a data set that's larger than memory and I need to process it. I am not experienced in this subject thus any directions can help.

I mostly figured out how to load the raw data as chunks but I need to process it and save the results, which likely to also be larger than memory. I have seen that pandas, numpy and python all support some form of memmap but I don't exactly understand how to go about and handle it. I expected an abstraction to be able to use my disk as I use my ram and interface with the object saved on disk as normal python/numpy/etc object when using memmap... but that isn't working for me whatsoever

# Create file to store the results in
x = np.require(np.lib.format.open_memmap('bla.npy',mode='w+'), requirements=['O'])
# Mutate it and hopefully these changes will be reflected in the file on disk?
x.resize(10,refcheck=False)
x
memmap([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
    shape, fortran_order, dtype = _read_array_header(fp, version)
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
    raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None
x[:] = list(range(10))
x
memmap([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
    shape, fortran_order, dtype = _read_array_header(fp, version)
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
    raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None

Which means the resize isn't being saved to disk

Any suggestion?

2
  • 1
    You might want to look into dask, which was designed for data that does not fit into memory. It has a pandas-like interface. dask.org Commented May 4, 2022 at 2:43
  • Thanks @jakub! I looked at dask and implemented some solution! However I am struggling to use my compressed files and parallelize it. I considering migrating my entire dataset into a different format or maybe even a database as dask has been proving difficult in that area. Commented May 13, 2022 at 13:39

1 Answer 1

0

np.require() makes a copy of the memmap array, since it doesn't "own" its data. According to the open_memmap() docs, you have to specify the shape when you open a file for writing. Otherwise, it writes "None" as the shape, which makes the y array open_memmap() call fail.

It looks like memmap arrays don't support resizing with .resize() (see numpy issue), but there's a workaround in this SO answer if you need that.

Sign up to request clarification or add additional context in comments.

1 Comment

I need a solution without specifying the shape manually

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.