Assessing the usefulness of HDF5 in bioinformatics

The HDF file format (http://www.hdfgroup.org/) has been around for decades, and more recently a bioinformatics-flavored version of HDF5, BioHDF, has been advertised (http://www.hdfgroup.org/projects/biohdf/)

On paper HDF5 appears to offer interesting features, especially the self-contained file system to organize datasets into groups. However, AFAIK this feature and others can be replicated at low or no cost with existing technologies (e.g., .zip files)

As such, I am wondering: why would anyone want to use HDF5 in bioinformatics? I am yet to find a compelling reason to use this technology, however cool it sounds. Anyone with a use case or good arguments?

There are are a couple scenarios that I’ve tested and found it helpful for. Note I don’t yet use .hdf5 extensively. There are others who are, and will be able to give a more detailed answer. I’m at the stage where I’ve been testing it.

  1. Retrieving arbitrary slices of a 3D cube seems very fast. My understanding is that if you want to retrieve a xz or yz slice from a tif file, you have to open up every plane. With HD5 you can retrieve arbitrary non-planar slices very quick.

  2. Retrieving an arbitrary 3D ROI is much quicker. Same as above. You don’t have to open up the entire file. This is really useful when processing an image that is to big to process in RAM.

The BigDataViewer, BigVolumeViewer, and imglib2-cache make use of hdf5. I think it is possible to use other formats with these tools but it is way, way slower.

The extraction of a certain view can also be done by using RandomAccessibleInterval construct of imglib2 for any file format.

HDF5 format is very useful to save trained datasets using Ilastik. In the classification and segmentation tasks the output contains segmentation/probability maps for different classes that the user chose in their Ilastik projects and h5 file format stores all the classes as image of the same dimension as the original data.

Once you have this massive information in h5 format you can use tools like BigDataViewer which use the constructs of imglib2 to access only the information or the view of the image you want. In the case of Ilastik you can for example access only the class of interest for a particular task.

1 Like

That’s not true in general. As long as a format stores images in blocks that can be individually accessed, and has stores multi-resolution pyramids, it is just as fast. Some examples are N5, Imaris file format (which is a HDF5 variant), and to some extend KLB (doesn’t have multi-resolution) and CATMAID (stores 2D tiles, not 3D blocks).

Regarding use of HDF5 as BigDatViewer format:
We chose it exactly because it provides the capabilities mentioned above: chunked datasets (blocked images) and many datasets in one file (filesystem in a file, storing multiple resolutions, timepoints, channels, …).
It has some serious drawbacks that let me doubt whether we would choose it again. In particular, it doesn’t support multithreaded writing and it has no journaling, i.e., if the computer crashes while writing/modifying a huge HDF5 file, it is likely that the whole file is unrecoverably corrupted.

The default implementation of N5 stores image blocks in individual files, which conveniently delegates these capabilities (multithreaded writing, journaling) to the file system. From my point of view, the only drawback is that you end up with millions of small files, which makes it cumbersome to copy datasets etc. In this regard, I like HDF5 better, where you can split the data into a few files.

Also, I should add, that the BDV perspective is a bit limited, because the BDV file format only uses a fraction of what HDF5 has to offer (structured data besides images, metadata, …)

Regarding the original question: one advantage of HDF5 over storing data in zip files etc certainly is that it has a system-independent definition of datatypes and how they are stored in binary form. E.g., issues such as endianness across different architectures etc are solved, metadata for datasets in an HDF5 tells you which datatype it is, etc. This you would have to think about, define, and implement in a homebrewed zip based format.

2 Likes