Next Generation File Formats for BioImaging

Today, the Chan Zuckerberg Initiative announced their first round of Essential Open Science Software projects, and we’re thrilled to have had Bio-Formats selected with an application on next generation file formats.

A part of that funding will help to support the image.sc community with their existing file format issues. There are billions of files left in proprietary formats that still need to be read and that situation will remain until they can all be converted to something common, open, and stable.

The other part of the funding, however, focuses on the design of such a next-generation file format that we’d like to discuss here. Our blog post this summer, OME’s position regarding file formats, outlined our reasoning. The feedback – likes, re-links, and personal comments – has been encouraging.

With the announcement today, it’s time to consider how best to move forward with the design and general consensus building. How does everyone have their voice heard? From vendors from #industry wanting a format that can be performantly written to users wanting something that saves them time and headaches.

Our first goal will be a design proposal like OME005 - which specified pyramidal OME-TIFFs, leading to a release last year. However, there are any number of preliminary discussions before a design proposal is ready. That discussion for OME005 mostly took place on a GitHub issue. But this next step needs to be as inclusive as possible, and image.sc seems the ideal place to do so (rather than, say, a slack that everyone needs to join).

This will be a bit of an experiment to find a way to use discourse that works for everyone. Options that we’re considering:

  • An open group that everyone can join and @-mention
  • A tag that everyone can follow
  • A discourse chat plugin

Anyone interested in getting involved is encouraged to jump into this thread. We’d encourage a certain chattiness (think gitter or zulip).

We’ll also be looking for ways to cross-link conversations to non-image.sc groups. For example, the Zarr team had a coordinated application funded, focused on building a new spec developed in concert with the n5 devs, leading hopefully to a useful base for a future imaging format. Regular calls, etc. will be organized as necessary.

This is all a great and exciting first step on a path to more common, open file-formats that we’re very much looking forward to.

The @OMETeam

24 Likes

You can count on us (Zeiss) and me, when it comes to being open for a real discussion.
It is crystal clear, that the lack of a suitable open data format (having in mind the latest trends in microscopy, image processing and data analysis) that fulfills the requirements of both academia and industry will ultimately limit evereybody.
I am aware of the different expectations that everybody has when it comes to the topic of data formats, so listening to everybody would be a first big step.
The question for me is what is the right way to facilitate such a discussion?

13 Likes

I’d love to take part in this discussion too and make sure #napari is able to provide high performance visualization of whatever next generation file format the community comes up with.

I’m open to any of the public discussion forums. I might shy away from too much of the “real-time” chats (gitter / zulip) expect for logistics and coordination as they can be harder to consume asynchronously.

5 Likes

Sounds great! Dealing with user problems with format compatibility in QuPath has always been a drag, and a unified file structure would be amazing. I am a complete novice when it comes to image formats, so my first questions are pretty basic:

Do the various file formats have benefits over each other from a fundamental data structure standpoint? Or, will some file formats always “work better” (more efficient access/file size-wise) in certain situations? Does the hardware used to collect an image ever impact some sort of need to alter the structure of the file format (tiling/stitching)? As much as it would be great to have a single unifying file format, it would be unfortunate if it wasn’t flexible enough (or load fast enough, which might be a concern in the complete opposite direction!) for image collection instruments currently in development. Whatever those might be.

Anyway, just some thoughts from a novice, and congratulations on the funding!

2 Likes

The Advanced Imaging Center and I would love to be involved in this process. The variety and scale of data we produce is perfect for trying out a new file format. Please keep me informed of this endeavor. We would love to help test and/or help code.

5 Likes

Congrats on the funding for this valuable effort!

I would love to improve ITK’s support for processing images based on open standard that with good support for parallel processing, n-dimensional images, and metadata preservation.

The Image.SC forum and GitHub issue tracker are great communication mediums. These are my personal preference over chat systems because they are easily searchable and can be read and written asynchronously.

4 Likes

That is fantastic news! I can only chime in with sebi06: We (Zeiss) are looking very much forward to be part of this. A certain balanced mix of asynchronous and person-to-person (virtual/in-person) exchange would be good. I’m personally fine with what-ever tool is proposed as communication channel.
Just looked through the Zarr team meeting minutes - you guys have quite some momentum. Need to catch up first. Is there a fast-track to get up-to-date?

2 Likes

Hi swg08, there is a blog post here with some info on current status of zarr spec work, may be useful: https://zarr-developers.github.io/zarr/specs/2019/06/19/zarr-v3-update.html

Fantastic to hear bioformats got funded, looking forward to hopefully joining up efforts where we can.

2 Likes

Thanks to everyone for all the feedback & engagement already.

CC’ing a few names that I’ve been collecting for when this thread started and haven’t seen them yet:
@heeler @jni @stefanv @Caterina @norio.kobayashi @Christian_Tischer @axtimwalde @tpietzsch (More to come)

as well as the same sentiment from @swg08, :100: – having the involvement of #industry will be essentialy. Huge appreciation to anyone who can use their contacts to point people to this topic.

Agreed. For the moment, I think speaking up on this thread is a good place to start and we can split off topics as necessary. That will at least get everyone CC’d. From our side, regular reports will be posted on image.sc, and impromptu or even regular calls are likely a must (though dealing with time zones will be interesting).

One of the first steps is likely a catalog of requirements & current solutions. Discussing those here (for example, subresolutions and how they are currently being handled) makes sense, but we will need to build up a representation of the consensus. Something tracked and editable like a wiki would be my first suggestion.

Oh wow. Jumping right into it! :clap: But agreed. I’m skeptical that a single thing will work. I’d think the trick will be to have the smallest number of formats/layouts that covers the requirements. And conversions between all the supported formats! (This is already a strategy that has been replayed a couple of times for Bio-Formats, BDV, and probably several others.)

:+1: Suggestions welcome on cadence and venues.

4 Likes

Congrats :confetti_ball::balloon::champagne: on the funding. Super excited :laughing: that this project has been funded, the topic is being discussed, and this issue is finally being addressed. I am happy to be part of this conversation especially concerning the Metadata side of things. It would be great to have Micromanager peeps involved. Also I would love to see a serious discussion on what role could the idea of Adaptive Particle Representation have in this discussion.

2 Likes

Chiming in also @dsudar as I have not seen it

Hi Josh
Good topic. Good reason for me to sign up to the forum. At Bitplane we are very interested in contributing to improve the future of storage. File conversions are so obviously painful that we should make an effort to try to get rid of as many as we can. I will be happy to contribute.

4 Likes

Great! Happy to contribute the N5 perspective including seamless bridging between N5 and Zarr as well as N5 and HDF5.

4 Likes

I am interested both from the scikit-image perspective, and as someone who—in collaborating with others—has to store and ingest various types of imaging data.

Having a bunch of test cases (images with usage requirements) would help shape the discussion around, e.g., metadata; looks like you have a good group for that here already, and I appreciate that you are trying to widen the catch net further.

While “Bio” is a keyword, I think you may (perhaps accidentally!) partially solve similar problems shared by material science and other fields.

From the scikit-image perspective, we would be happy to help build tooling so that users can easily import and manipulate the new format(s).

I really like Zarr, and my primary concern right now is that it allows too much freedom. There are too many compressors, options for chunking / splitting storage, etc. So, pairing those options down to best suite this specific application will be a great contribution and a helpful example.

W.r.t. discussion media, +1 on longer fora such as this one or GitHub that allows us to easily track what has been written.

2 Likes

Congratulation, that’s very good.
For WSI POV, there’s still no suitable open 16bit compressed tiled WSI format, correct?
To get s.th. for this would be great - and maybe make things like jpeg-xr obsolete?

Regards,
Manuel

2 Likes

Probably also worth giving TileDB a serious look too - main repo. I think their core is C and they have a lot of cross language support already (Python, Java, R, Go, C/C++).

Fantastic that this huge gap is getting some funding. It’s a daunting problem for us in EBI’s Functional Genomics team. Not only the raw image format problem, but lack of harmonised metadata and analysed data format standards makes it difficult to integrate data. This is technically easy to solve but its’s difficult to reach community consensus. Hopefully a chatty inclusive forum can get this work done.

Integrating datasets of the same modality from different labs is one challenges we have, but eventually the community will also need to integrate across multiple modalities for multi omic datasets. So following best metadata practice now would make this much easier to do in the future.

I’m more than happy to help. Just let me know what you need.

So interesting !
With the variety of my users data (size, scale, etc.) I have to handle and help them analyse on different open-source and licensed software form different Industrial microscope. It would be a pleasure to help at my level of microscopist platform engineer ;). As I’m not a bio-image analyst maybe I could help as a tester let me know if you need