Pandas/Numpy and Larger than memory array of strings

Question

I have a data set that's larger than memory and I need to process it. I am not experienced in this subject thus any directions can help.

I mostly figured out how to load the raw data as chunks but I need to process it and save the results, which likely to also be larger than memory. I have seen that pandas, numpy and python all support some form of memmap but I don't exactly understand how to go about and handle it. I expected an abstraction to be able to use my disk as I use my ram and interface with the object saved on disk as normal python/numpy/etc object when using memmap... but that isn't working for me whatsoever

# Create file to store the results in
x = np.require(np.lib.format.open_memmap('bla.npy',mode='w+'), requirements=['O'])
# Mutate it and hopefully these changes will be reflected in the file on disk?
x.resize(10,refcheck=False)
x
memmap([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
    shape, fortran_order, dtype = _read_array_header(fp, version)
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
    raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None
x[:] = list(range(10))
x
memmap([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
    shape, fortran_order, dtype = _read_array_header(fp, version)
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
    raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None

Which means the resize isn't being saved to disk

Any suggestion?

You might want to look into dask, which was designed for data that does not fit into memory. It has a pandas-like interface. dask.org — jkr
– jkr, Commented May 4, 2022 at 2:43
Thanks @jakub! I looked at dask and implemented some solution! However I am struggling to use my compressed files and parallelize it. I considering migrating my entire dataset into a different format or maybe even a database as dask has been proving difficult in that area. — Yorai Levi
– Yorai Levi, Commented May 13, 2022 at 13:39

yut23 · Accepted Answer · 2022-05-04 02:39:15Z

0

np.require() makes a copy of the memmap array, since it doesn't "own" its data. According to the open_memmap() docs, you have to specify the shape when you open a file for writing. Otherwise, it writes "None" as the shape, which makes the y array open_memmap() call fail.

It looks like memmap arrays don't support resizing with .resize() (see numpy issue), but there's a workaround in this SO answer if you need that.

answered May 4, 2022 at 2:39

yut23

3,09313 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Yorai Levi Over a year ago

I need a solution without specifying the shape manually

Collectives™ on Stack Overflow

Pandas/Numpy and Larger than memory array of strings

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related