0

The topic of 'how to store numpy' in sqlite has been discussed a couple of times. I think there are mainly two solutions:

Now let us suppose the numpy array is huge and only a slice of it is needed from the database. What is the best approach to select and retrieve the sliced data?

SELECT substr(data,start,length)

This will work for as long as data is BLOB and the numpy type np.int/uint8 was stored without the numpy data frame into BLOB. What about other data types such as np.uint64?

Of course, it is possible to store the numpy type in the sqlite database, too. Then any sliced-data request would need to adapt the SELECT request and the start/length information accordingly, i.e., scale it by the number of bytes of the respective data type.

Is there a better way to do this?

7
  • For large arrays, this is not a good idea to embed the Numpy array in the SQLite database. This makes the DB requests far less efficient (most SQL databases are not optimized for that) and harder to migrate. I think the best is to store a reference to an external file (possibly a unique one regarding your needs). This also improve/fix the performance of slicing. Commented Jun 27, 2022 at 23:17
  • 1
    While this can certainly be done, it is probably a bit cumbersome as you need strides information as well as the datatype to extract a sliced data, unless you have 1D data. Also, you can certain peek here (even if it is C++, the idea is the same): stackoverflow.com/questions/3005231/… Commented Jun 28, 2022 at 8:14
  • Would "huge" mean that it would not fit into memory, or just some big array you do not really want to load entirely? Commented Jun 28, 2022 at 14:43
  • each entry could be 1M to 10M data points (int8 or int16) ... and sometimes I'm only interested in, e.g., 50 000 data points. of course, I could use HDF5, but these files get corrupted easily when creating them, are not as easy to use in terms of parallel threads, and I think are somewhat more limited when it comes to multi-dimensional data and searching (since no SQL-like querying exists) Commented Jun 28, 2022 at 17:26
  • What about using parallel HDF5 then? HDF5 seems clearly better suited for the mentioned usage. With a reference to an HDF5 entry you can benefit from the SQL request and fast slicing (without making your database slower -- considering SQLite is known to be already pretty slow). Note that if the extracted points are not contiguously stored, then it is should be faster to read the whole array (especially on HDD storages). Commented Jun 28, 2022 at 19:52

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.