Mirroring IDR .zarr datasets

Hello!

I frequently use zarr datasets to test napari. @joshmoore, @will-moore, and others in the OME team have been doing super cool work serving up datasets to browse remotely. For example, if you pip install napari ome-zarr, you can do:

napari https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/6001240.zarr

and browse that dataset in napari. You can also do:

ome-zarr download https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/6001240.zarr

to download it locally, then napari 6001240.zarr to open it locally (much better performance).

One issue I’ve run into, though, is that the latency from the UK to Australia is a killer. So although I’ve been able to download (some of) these datasets to try things out locally, it’s very hard to use napari with the remote copy, and I’ve started to work with our local research cloud to host these closer to (my) home.

In addition to the above (tiny test) data, there are a couple of other datasets that would be useful to test:

https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/9822151.zarr

(The SARS-CoV-2 EM volume from the tweet linked above; see also this forum post.)

and

https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/4007801.zarr

(2TB+ 3D+t volume.)

Again, thanks to the great work of the OME team, downloading these is easy. What I’d like to know is how to host them on local cloud infrastructure, so that I can do something like napari https://datasets.nectar.org.au/idr/zarr/v0.1/9822151.zarr and have reasonable performance.

I’m totally naive when it comes to s3 interfaces and serving up object stores, so the more detailed the instructions, the better!

Thank you!

6 Likes

For hosting them locally we use the open-source Minio server. You can either run it as a full S3 server, or as a NAS gateway onto a local filesystem which is what we do. If you point it to a local directory /data it’ll serve all the subdirectories under /data as buckets, and treat everything else as keys and objects.

You’ll probably want to set one or more of your buckets to public: https://docs.min.io/docs/minio-client-complete-guide.html#policy

In practice you might be able to get away with serving the directory with a plain HTTP server like Nginx, since the S3 files are public you can access them as plain HTTP if you know the full path, and I think Zarr may “just work”.

@joshmoore is currently generating a full filelist so you can download everything more easily.

Hope this helps!

3 Likes

Starting with the smaller example, download and unzip the attached file and try executing:

parallel -a 9822151.zarr.txt -j 16 -- wget -x

(In my case, both parallel and wget are installed with brew)

I’m also working to finally make these buckets public in which case this will become simple aws s3 cp .... Keep your fingers crossed!)

Cheers,
~Josh

9822151.zarr.txt.zip (46.8 KB)

1 Like

@jni

This is a great idea and one we definitely support. I hope we can work out the technical issues rapidly.

One other thing to consider is licensing of the original data. For the dataset at

https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/9822151.zarr

the licensing is CC-BY 4.0, copyright is by the study authors. Obviously, we need to get all that and much more metadata properly linked with the file-- we have proposals pending for that work and hope to begin real work soon. For the moment, please do honor the terms of the original license. The data submitters deserve their fair share of recognition for creating these data and submitting them for publication.

Again, we’re really excited to see these ideas go forward.

Cheers,

Jason

1 Like

Thanks @jrswedlow, a very good point. Can this go in the .zattrs somehow, @joshmoore?

To clarify, this is useful for speeding up the download, but otherwise equivalent to ome-zarr download ..., right?

It definitely will. We’re just working through the list of things that need adding. Top priority was to open up the binary data, hence the non-machine-readable agreement on copyright. :wink:

Correct.
~J.

Our cloud provider has made the entire idr bucket public, so listing should now work. For example you can use tools such as rclone if you want.

Or awscli (which can be installed with pip install awscli). E.g. to list all the available Zarrs:

aws --no-sign-request s3 --endpoint-url=https://s3.embassy.ebi.ac.uk ls s3://idr/zarr/v0.1/
                           PRE 179706.zarr/
                           PRE 1884807.zarr/
                           PRE 4007801.zarr/
                           PRE 4495402.zarr/
...

or download one locally:

aws --no-sign-request s3 --endpoint-url=https://s3.embassy.ebi.ac.uk cp s3://idr/zarr/v0.1/6001240.zarr . --recursive
download: s3://idr/zarr/v0.1/6001240.zarr/.zgroup to ./.zgroup
download: s3://idr/zarr/v0.1/6001240.zarr/.zattrs to ./.zattrs
download: s3://idr/zarr/v0.1/6001240.zarr/0/0.0.10.0.0 to 0/0.0.10.0.0
download: s3://idr/zarr/v0.1/6001240.zarr/0/0.0.100.0.0 to 0/0.0.100.0.0
...

~Josh

1 Like

Are any of these Zarr files hosted some where with a CORS headers set to allow access from other hosts?

(Context: I was looking to construct a neuroglancer example for reading the data, on par with the napari ones.)

Hey @perlman,

Sorry, we’re still waiting to hear back from EBI about activating CORS on the bucket. Assuming that turns out not to be possible, we will need to put our own proxy in place. We’ll keep you posted.

~Josh

1 Like

@perlman et al. Can you give https://idr-s3.openmicroscopy.org/idr a try as a replacement for https://s3.embassy.ac.uk/idr?

Current list to try with:

mc ls ebi/idr/zarr/v0.1
[2020-09-14 17:27:45 CEST]      0B 179706.zarr/
[2020-09-14 17:27:45 CEST]      0B 1884807.zarr/
[2020-09-14 17:27:45 CEST]      0B 4007801.zarr/
[2020-09-14 17:27:45 CEST]      0B 4495402.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001237.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001238.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001239.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001240.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001241.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001242.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001243.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001244.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001245.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001246.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001247.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001248.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001249.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001250.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001251.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001252.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001253.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001254.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001255.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001256.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001257.zarr/
[2020-09-14 17:27:45 CEST]      0B 6001258.zarr/
[2020-09-14 17:27:45 CEST]      0B 9798462.zarr/
[2020-09-14 17:27:45 CEST]      0B 9822151.zarr/
[2020-09-14 17:27:45 CEST]      0B 9822152.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836831.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836832.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836833.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836834.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836835.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836836.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836837.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836838.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836839.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836840.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836841.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836842.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836843.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836844.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836845.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836846.zarr/
[2020-09-14 17:27:45 CEST]      0B 9836950.zarr/
2 Likes

Yep, it works! I get the magic access-control-allow-origin: *

Here’s a neuroglancer demo viewing image 55204.

2 Likes

Nice, @perlman! Can a pyramid (i.e. .zgroup rather than a .zarray) also be read?

Not yet. Neuroglancer supports N5 pyramids using the n5-viwer naming convention, which is the format & naming convention I’ve been using for work with @mellertd.

It should be straight forward to add an ome-zarr datasource with pyramid support. I was waiting for the format to stabilize before tackling it myself (or creating a github issue on neuroglancer and seeing if someone else would do it. :wink:

1 Like

Alternatively, if I can be of help having the n5 stack support the multiscales attribute, let me know.

Something changed with idr-s3.openmicroscopy.org in the past week to send two access-control-allow-origin: * headers back to the client.

This is causing both FireFox & Chrome to deny Neuroglancer access to the data:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://idr-s3.openmicroscopy.org/idr/outreach/55204.zarr/.zarray. (Reason: CORS header ‘Access-Control-Allow-Origin’ does not match ‘*, *’).

This seems to occur only when the request contains an ‘origin’ field, e.g.:

curl -v -s https://idr-s3.openmicroscopy.org/idr/outreach/55204.zarr/.zattrs --header 'origin: https://neuroglancer-demo.appspot.com'

@joshmoore Do you have any idea what changed?