Both are good questions. I can't give a precise answer without knowing more about your data and workflows. (Note: The HDF Group has a good overview you might want to review here:Introduction to HDF5. It is a good place to learn the possibilities with schema design.) Here are things I would consider in a "thought experiment":
The best structure:
With HDF5, you can define any schema you want (within limits), so the best structure (schema), is the one that works best with your data and processes.
- Since you have an existing CSV file format, the simplest approach is creating an equivalent NumPy dtype, and referencing it to create a recarray that holds the data. This would mimic your current data organization. If you want to get fancier, here are other considerations:
- Your datatypes: are they homogeneous (all floats or all ints), or heterogeneous (a mix of floats, ints and strings)? You have more options if they are all the same. However, HDF5 also supports mixed types as compound data.
- Organization: How are you going to use the data? A properly designed schema will help you avoid data gymnastics in the future. Is it advantageous (to you) to save everything in 1 dataset, or to distribute across different datasets/groups? Think of data organized in folders and files on your computer. HDF5 Groups are your folders and the datasets are your files.
- Convenience of working with the data: similar to organization. How easy/hard it is to write vs read it. It might be easier to write it as you get it - but is that a convenient format when you want to process it?
How should I write the data?
There are several Python packages that can write HDF5 data. I am familiar with PyTables (aka tables) and h5py. (Pandas can also create HDF5 files, but I have no experience to share.) Both packages have similar capabilities, and some differences. Both support HDF5 features you need (resizeable datasets, homogeneous and/or heterogeneous data). h5py attempts to map the HDF5 feature set to NumPy as closely as possible. PyTables has an abstraction layer on top of HDF5 and NumPy, with advanced indexing capabilities to quickly perform in-kernel data queries. (Also, I found PyTables I/O is slightly faster than h5py.) For those reasons, I prefer PyTables, but I am equally comfortable with h5py.
How often should I write: every 1 or N iterations, or once at the end?
This is a trade-off of available RAM vs required I/O performance vs coding complexity. There is an I/O "time cost" with each write to the file. So, the fastest process is to save all data in RAM and write at the end. That means you need enough memory to hold a 15 minute datastream. I suspect memory requirements will drive this decision. The good news: PyTables and h5py will support any of these methods.