Publication of idr0083: SARS-CoV-2 up close in a human organoid

Data for Lamers et al. showing SARS-CoV-2 in intestinal organoids now available in the the IDR.

The dataset is available as idr0083. For close up views of the viruses, see:

along with the other ROIs.

The 2 30GB images have also been converted to Zarr with bioformats2raw and are available via public S3 thanks to EMBL-EBI at:

with the layout as specified in https://github.com/ome/omero-ms-zarr/blob/08f9c8a07194cc0c3da16c95682fee9f0ba46bf5/spec.md :

    ├── .zgroup               # Each image is a Zarr group with multscale metadata.
    └── 0                     # Each multiscale level is stored as a separate Zarr array.
        ├── .zarray           #
        ├── 0.0.0.0.0         # Chunks are stored with the flat directory layout.
        └── t.c.z.y.x         # All image arrays are 5-dimensional
                              # with dimension order (t, c, z, y, x).

The @OMETeam


** These are S3 endpoints rather than public webpages. See the Python code below or use the AWS CLI to explore them. For example:

$ aws s3 cp --no-sign-request --endpoint-url https://s3.embassy.ebi.ac.uk s3://idr/zarr/v0.1/9822151.zarr/0/.zarray - | jq
{
  "shape": [
    1,
    1,
    1,
    167424,
    79360
  ],
12 Likes

Hi @OMETeam,

I’ve tried to download the dataset but seen that idr0083 has only 2 images. When I use the S3 links you provided, I get an Access Forbidden. Could you please check that the objects are publicly-readable?

Thank you!

Beatriz

1 Like

Pretty sure directory listings are forbidden there by design, @BeatrizSerrano. The Zarr dataset for those images will be many thousands of files and is designed to be used directly from EMBL-EBI rather than “downloaded” per say. For example:

$ ipython
Python 3.5.2 (default, Oct  8 2019, 13:06:37)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.9.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import s3fs

In [2]: import zarr

In [3]: s3 = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url': 'https://s3.embassy.ebi.ac.uk/'})

In [4]: store = s3fs.S3Map(root='idr/zarr/v0.1/9822151.zarr', s3=s3, check=False)

In [5]: root = zarr.group(store=store)

In [6]: root.attrs.asdict()
Out[6]:
{'multiscales': [{'datasets': [{'path': '0'},
    {'path': '1'},
    {'path': '2'},
    {'path': '3'},
    {'path': '4'},
    {'path': '5'},
    {'path': '6'},
    {'path': '7'},
    {'path': '8'},
    {'path': '9'},
    {'path': '10'}],
   'version': '0.1'}]}

In [7]: resolution = root['/0']

In [8]: resolution.info
Out[8]:
Name               : /0
Type               : zarr.core.Array
Data type          : >u2
Shape              : (1, 1, 1, 167424, 79360)
Chunk shape        : (1, 1, 1, 1024, 1024)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : fsspec.mapping.FSMap
No. bytes          : 26573537280 (24.7G)
Chunks initialized : 0/12792
1 Like

@chris-allan

I’m getting the following error when I try and access the data as you described - line [5] in the above.

import s3fs
import zarr

s3 = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url': 'https://s3.embassy.ebi.ac.uk/'})
store = s3fs.S3Map(root='idr/zarr/v0.1/9822151.zarr', s3=s3, check=False)
root = zarr.group(store=store)

error:

    root = zarr.group(store=store)
  File "/Users/nsofroniew/opt/anaconda3/lib/python3.7/site-packages/zarr/hierarchy.py", line 1054, in group
    path=path)
  File "/Users/nsofroniew/opt/anaconda3/lib/python3.7/site-packages/zarr/storage.py", line 432, in init_group
    chunk_store=chunk_store)
  File "/Users/nsofroniew/opt/anaconda3/lib/python3.7/site-packages/zarr/storage.py", line 453, in _init_group_metadata
    store[key] = encode_group_metadata(meta)
  File "/Users/nsofroniew/opt/anaconda3/lib/python3.7/site-packages/fsspec/mapping.py", line 96, in __setitem__
    f.write(value)
  File "/Users/nsofroniew/opt/anaconda3/lib/python3.7/site-packages/fsspec/spec.py", line 1245, in __exit__
    self.close()
  File "/Users/nsofroniew/opt/anaconda3/lib/python3.7/site-packages/fsspec/spec.py", line 1213, in close
    self.flush(force=True)
  File "/Users/nsofroniew/opt/anaconda3/lib/python3.7/site-packages/fsspec/spec.py", line 1085, in flush
    self._initiate_upload()
  File "/Users/nsofroniew/opt/anaconda3/lib/python3.7/site-packages/s3fs/core.py", line 1004, in _initiate_upload
    raise translate_boto_error(e)
PermissionError: Access Denied

Any help/ tips would be appreciated!! Thanks :slight_smile:

Also curious if you know the syntax to make this work with dask.array.from_zarr. I’m particularly interested in that as I’d like to leverage some of dask’s caching functionality.

Turns out the following works fine for me, not sure why the other approach failed

import dask.array as da

path = 'https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/9822151.zarr'
resolutions = [da.from_zarr(path, component=str(i)) for i in list(range(11))]
1 Like

Hi @sofroniewn

The following code snippet works fine for me.

def load_binary_from_s3(id, resolution='4'):
    cache_size_mb = 2048
    cfg = {
        'anon': True,
        'client_kwargs': {
            'endpoint_url': 'https://s3.embassy.ebi.ac.uk',
        },
        'root': 'idr/zarr/v0.1/%s.zarr/%s/' % (id, resolution)
    }
    s3 = s3fs.S3FileSystem(
        anon=cfg['anon'],
        client_kwargs=cfg['client_kwargs'],
    )
    store = s3fs.S3Map(root=cfg['root'], s3=s3, check=False)
    cached_store = zarr.LRUStoreCache(store, max_size=(cache_size_mb * 2**20))
    # data.shape is (t, c, z, y, x) by convention
    return da.from_zarr(cached_store)

For my purpose, I was checking out only one resolution.
Jmarie

1 Like

Thank you @chris-allan. I guess it’s not possible to download the data using Aspera either, as I don’t see the dataset available under idr0083 in the web client. Is there any other way to get the raw images locally?

I’ll let the IDR team comment on Aspera availability, @BeatrizSerrano.

@sofroniewn: What that traceback is telling you, in a not so helpful way, is that Zarr thinks the group metadata is missing. Zarr is then trying to initialize a new group and of course this being a read-only bucket, fails.

The line numbers in your stack trace don’t match the latest version of zarr (2.4.0), s3fs (0.4.2) or fsspec (0.7.3). These packages are all under very active development so I’d recommend upgrading if you can and trying again. If that’s not easy for you to do I’d play with leading and trailing / in your endpoint_url and root specifications.

2 Likes

Hi Beatriz,

the request for aspera access has gone in. We’ll keep you posted.
~Josh

Awesome, thank you both! :slight_smile:

Nice, @sofroniewn, thanks! Hadn’t come across this short-cut. I’d slightly extend it to:

import requests
import dask.array as da

path = 'https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/9822151.zarr'
try:
    datasets = [x['path'] for x in
                requests.get(f'{path}/.zattrs').json()["multiscales"][0]["datasets"]]
except KeyError:
    datasets = ["0"]
resolutions = [da.from_zarr(path, component=str(i)) for i in datasets]

which takes the first set of multi-scales. We’ve already run into the use-case where depending on usage, the user might want to pick a different multiscale (e.g. 2D vs. 3D). Our proposal would be to do that by name:

multiscales = requests_json["multiscales"]
datasets = []
for named in multiscales:
    if named['name'] == '3D':
        datasets = [x['path'] for x in named["datasets"]]
        break
if not datasets:
    # Use the first by default. Or perhaps choose based on chunk size?
    datasets = [x['path'] for x in multiscales[0]["datasets"]]
...
2 Likes

@BeatrizSerrano I hope idr0083 will be publicly available via Aspera early next week. As @joshmoore said, we will update this thread as soon as this is the case.

In the interim, you should be able to download the OME-TIFF files submitted to EMPIAR alongside the IDR submission - see EMPIAR-10404 for more details.

Update (2020-05-05): the raw data for the idr0083 is now downloadable via Aspera.

2 Likes