Data archiving: what format to use, which files to keep?

Hi, I am a data manager for a neuroscience group and we are producing .nd2 files. I wonder how to archive this data.
As you probably know, long term preservation is associated with costs, and we cannot keep the raw data in different formats. We want to be sure we do not loose important information, and we want to be sure the file we archive will still be operable in 10 years (i.e. it would be better to use an open format).

Should we archive the .nd2 files or should we export them in .ome.tiff and archive these exported files ? (or is even another new format better ?)

1 Like

HI, @jcolomb,
welcome to the forum !
It seems to me a file compression problem (raw data took too much space ?).
As long as you use a lossless compression. You can always find a tool in the future to read it. 7Zip is a good tool for this. I tried my images: I lost less than 10 pixels in 3 * 10^8 pixels.

If you insist on using a certain image format. I would say the .ome.tiff is a good option (open-source based). Otherwise, you have hdf5 which has also an internal compression option like gzip.
FYI, NASA is using hdf5 to save satellite images. So it doesn’t likely to disappear in the next decades…

How do you explain that? There should be no loss at all!

Well, maybe it’s the method I used was not appropriate.
I used != with numpy array and you can specify the precision while loading array with numpy.
Potentially, it’s possible the compression itself is really lossless. But my method of array loading is error-prone ?
I didn’t go further, because, frankly, even it’s 10 pixels. It’s nothing, at least in my analysis. I can tolerate this “error”.

The compression should be lossless with 7zip.

hi, thanks for the welcome address and the quick response.

It is not that much a question of compression than the question of what data format to archive. The raw data is in .nd2 format, and we could export it in .ome.tiff (or other?) open format, but we would like to avoid archiving the same information twice (whatever the compression, it will be picobytes of data per year for a large university).

I am afraid the .nd2 format will not be readable in 10 years and is therefore not good for archiving. On the other hand, I do not know if all the information is transferred in the other formats.

so, but this is not more related to which data management system you are using? more than which format?

I am not sure what you mean by “data management system”, I am trying to figure out what files will enter the long term archiv: the .nd2 files, or their open format pendant, or if we need both…

(the reseachers are analysing the images with FIJI. If I understood correctly, they are therefore already creating -even if not saving- .ome.tiff files in the process.)

From what you said, I think you may need to elaborate more about:
what is the “same information” ? Metadata ?

If you are just looking for a data format that will keep being supported by the community. The best shot for instance is the .ome.tiff, especially you want to save the metadata. However, make sure through the conversion from .nd2 to .ome.tiff you have all metadata transferred. Because that is often not the case…

FYI, I think what you are looking for is sth the community is currently working on. Let’s hope @Virginie can find someone appropriate for this job :wink:
https://www.embl.de/jobs/searchjobs/index.php?ref=EBI01462&newlang=1&srch_trm=data

That is exactly the point, I suppose the data is transferred easily, and I am most worried about loosing metadata, and since I have no experience in this topic, I wonder how I could make sure all the metadata will be transferred.
This probably also means I will not be able to trust the scientist with this task and will need to make the transformation/check myself ?

If all metadata matters, I would recommend to 7zip the raw .nd2 files you have. Not any conversion.
Or if you want to DIY, you may want to check this topic

1 Like

Hi all,

:+1: for this conversation in general. If if there’s not yet an ideal solution, it’s great to hear what everyone needs/wants/etc.

A few points:

  • When it comes to capturing all the data of the .nd2, I’m comfortable saying that OME-TIFF can safely archive all the pixels for 10 years.
  • The issue with archiving the metadata is that we don’t know if Bio-Formats is even capturing all the metadata that you know to be in the files. This is a wider issue that we will need vendor support to address long-term.
  • We certainly will try to maintain support for all current files in Bio-Formats over the next decade, but that’s obviously a daunting proposition. If this is the route you choose to take, other than helping us to maintain/fund Bio-Formats, probably the most important thing you can do is provide samples of this data to our regression suite.

I agree that HDF5 like TIFF is a stable container format, but there currently isn’t a spec for bioimages in HDF5 that will necessarily be interpretable over this time span. I’d suggest some discussion/planning (and/or involvement in upcoming specification work!) before going this route.

This is exactly the problem. It would be great if we could work together to determine if everything is being extracted from the ND2s. If not, that’s something that can (and should) be accomplished before you begin a transformation process.

It’s indeed not easy. One option is to make use of the existing vendor SDK for a file format and see if metadata is accessible/exportable that Bio-Formats is not aware of. Another option is to compare the vendor GUI with what you see in Fiji. Longer-term we need a mechanism for all vendors to export an open, interpretable record of stored metadata.

~Josh

1 Like

Hi again,
I have been looking into BIDS, normally used for imaging data.
They mostly keep both the original data file (in a “sourcedata” folder) and the open formatted one (NIfTI format, in a “raw data” folder).

I still think that is an expensive solution, and will try to look for a better one.

Have you ever tried to reach out to the vendors, do you have a strategy there?

For what it’s worth, OMERO only stores the original data (with some exceptions) but that is a situation that has its own downsides. The most sustainable solution will be a common format that meets everyone’s needs.

There have been discussions (and in most cases a sizable amount of good will) for years. When on the market for an acquisition system, purchasers should ask for an open format. If and when there are specifics for why that’s not possible, those will need to be addressed in updated and if necessary new open formats.

~Josh

Hi,

Just a couple of thoughts.

Losing immediate functionality
If you’re not storing the propriety format you are losing important functionality.

As annoying as proprietary formats are, they store the metadata in way that is compatible with the acquisition equipment.

This allows users to return to microscope and easily replicate the setup from the file. This is not possible if the file had been converted.

There is no universal format
Each file format organises the metadata is a way that is convenient for that instrument. This is the schema and it reflects how the manufactures choose to conceptualize the imaging process.

For example, think about describing the filters on a fluoresce microscope. One maker might use filter cubes so it makes sense to group the beam splitter and emission and excitation filters into a single unit. Another may use filter wheels, and yet another may multiple beam splitters and emission filters going to multiple cameras.

A common file format is going to have to employ a mega-schema that can describe all these situations, whilst the manufactures use a minimal one that suits the equipment.

Working with the mega-scheme is cumbersome as it will be full of details that are irrelevant most of the time. And as pointed out above, maintaining the mappings to the mega scheme is a Sisyphean task and there will always be parts missed.

Maintain a computer environment to translate the file
Rather than keeping the original and a converted one, you could think about keeping the original file and maintaining an environment that can translate it.

I think it should be possible to setup a virtual machine or container that runs a program (bioformats / imageJ) to convert the raw image. We have virtual machines to run windows XP software that are well over 10 years old.

The other advantage is that as bioformats improves with time less data is lost at the conversion stage. This is pretty much how OMERO works.

Cheers,

Chris

1 Like