Multiscale arrays v0.1

The following text is being transferred from zarr-specs#50 after the last Zarr community meeting. The metadata specification was thought to be more specific to the imaging domain and is in fact in use by a number of image.sc community repositories. Rather than defining a new GitHub home or duplicating it in multiple repositories, it’s being published here with a new “convention” tag.


As a first version of support for the multiscale use-case, this issue proposes an intermediate nomenclature for describing groups of hierarchical (formerly “Zarr”) arrays which are scaled down versions of one another, e.g.:

example/
├── 0    # Full-sized array
├── 1    # Scaled down 0, e.g. 0.5; for images, in the X&Y dimensions
├── 2    # Scaled down 1, ...
├── 3    # Scaled down 2, ...
└── 4    # Etc.

This layout was independently developed in a number of implementations and has since been implemented in others. Using a common metadata representation across implementations:

  1. fosters a common vocabulary between existing implementations
  2. enables other implementations to reliably detect multiscale arrays
  3. permits the upgrade of v0.1 arrays to future versions of this or other extension
  4. tests this extension for limitations against multiple use cases

A basic example of the metadata that is added to the containing hierarchical group is seen here:

{
  “multiscales”: [
    {
      “datasets” : [
          {"path": "0"},
          {"path": "1"},
          {"path": "2"},
          {"path": "3"},
          {"path": "4"}
        ]
      “version” : “0.1”
    }
     // See the detailed example below for optional metadata
  ]
}

Detailed example

Color key (according to https://www.ietf.org/rfc/rfc2119.txt):

- MUST     : If these values are not present, the multiscale series will not be detected.
! SHOULD   : Missing values may cause issues in future versions.
+ MAY      : Optional values which can be readily omitted.
# UNPARSED : When updating between versions, no transformation will be performed on these values.

Color-coded example:

-{
-  "multiscales": [
-    {
!      "version": "0.1",
!      "name": "example",
-      "datasets": [
-        {"path": "0"},
-        {"path": "1"},
-        {"path": "2"}
-      ],
!      "type": "gaussian",
!      "metadata": {
+        "method":
#          "skiimage.transform.pyramid_gaussian",
+        "version":
#          "0.16.1",
+        "args":
#          [true],
+        "kwargs":
#          {"multichannel": true}
!      }
-    }
-  ]
-}

Explanation

  • The multiscales key of the group metadata contains a list of datasets permitting multiple multiscale series to be present in a single group.
  • By convention, the first multiscale should be chosen if all else is equal.
  • Alternatively, a multiscale can be chosen by name or with slightly more effort, by the zarray metadata like chunk size.
  • The paths to the arrays in dataset series MUST be ordered from largest (i.e. highest resolution) to smallest.
  • These paths could potentially point to datasets in other groups via “…/foo/0” in the future. For now, the identifiers MUST be local to the annotated group.
  • The type values SHOULD come from the enumeration below.
  • The metadata example is taken from https://scikit-image.org/docs/dev/api/skimage.transform.html#skimage.transform.pyramid_reduce

Type enumeration:

Sample code for Zarr

#!/usr/bin/env python
import argparse
import zarr
import numpy as np
from skimage import data
from skimage.transform import pyramid_gaussian, pyramid_laplacian

parser = argparse.ArgumentParser()
parser.add_argument("zarr_directory")
ns = parser.parse_args()


# 1. Setup of data and Zarr directory
base = np.tile(data.astronaut(), (2, 2, 1))

gaussian = list(
    pyramid_gaussian(base, downscale=2, max_layer=4, multichannel=True)
)

laplacian = list(
    pyramid_laplacian(base, downscale=2, max_layer=4, multichannel=True)
)

store = zarr.DirectoryStore(ns.zarr_directory)
grp = zarr.group(store)
grp.create_dataset("base", data=base)


# 2. Generate datasets
series_G = []
for g, dataset in enumerate(gaussian):
    if g == 0:
        path = "base"
    else:
        path = "G%s" % g
        grp.create_dataset(path, data=gaussian[g])
    series_G.append({"path": path})

series_L = []
for l, dataset in enumerate(laplacian):
    if l == 0:
        path = "base"
    else:
        path = "L%s" % l
        grp.create_dataset(path, data=laplacian[l])
    series_L.append({"path": path})


# 3. Generate metadata block
multiscales = []
for name, series in (("gaussian", series_G),
                     ("laplacian", series_L)):
    multiscale = {
      "version": "0.1",
      "name": name,
      "datasets": series,
      "type": name,
    }
    multiscales.append(multiscale)
grp.attrs["multiscales"] = multiscales

which results in a .zattrs file of the form:

{
    "multiscales": [
        {
            "datasets": [
                {
                    "path": "base"
                },
                {
                    "path": "G1"
                },
                {
                    "path": "G2"
                },
                {
                    "path": "G3"
                },
                {
                    "path": "G4"
                }
            ],
            "name": "gaussian",
            "type": "gaussian",
            "version": "0.1"
        },
        {
            "datasets": [
                {
                    "path": "base"
                },
                {
                    "path": "L1"
                },
                {
                    "path": "L2"
                },
                {
                    "path": "L3"
                },
                {
                    "path": "L4"
                }
            ],
            "name": "laplacian",
            "type": "laplacian",
            "version": "0.1"
        }
    ]
}

and the following on-disk layout:

/var/folders/z5/txc_jj6x5l5cm81r56ck1n9c0000gn/T/tmp77n1ga3r.zarr
├── G1
│   ├── 0.0.0
...
│   └── 3.1.1
├── G2
│   ├── 0.0.0
│   ├── 0.1.0
│   ├── 1.0.0
│   └── 1.1.0
├── G3
│   ├── 0.0.0
│   └── 1.0.0
├── G4
│   └── 0.0.0
├── L1
│   ├── 0.0.0
...
│   └── 3.1.1
├── L2
│   ├── 0.0.0
│   ├── 0.1.0
│   ├── 1.0.0
│   └── 1.1.0
├── L3
│   ├── 0.0.0
│   └── 1.0.0
├── L4
│   └── 0.0.0
└── base
    ├── 0.0.0
...
    └── 1.1.1

9 directories, 54 files
Revision Source Date Description
0.1.8 Feedback from @mtbc 2020.05.18 Add clarification of fields
0.1.7 Migration to image.sc 2020.05.17 Replace “Zarr” with “hierarchical”; dropped “Process” section
0.1.6 External feedback on twitter and image.sc 2020.05.06 Remove “scale”; clarify ordering and naming
0.1.5 External bug report from @mtbc 2020.04.21 Fixed error in the simple example
0.1.4 comment-599782137 2020.04.08 Changed “name” to “path”
0.1.3 Discussions up through comment-59978213 2020.04.01 Updated naming schema
0.1.2 comment-595505162 2020.03.07 Fixed typo
0.1.1 @joshmoore 2020.03.06 Original text from in person discussions
5 Likes

Hi @joshmoore,

Thanks for this.

Where+how would the precise downsampling factors be stored? In the metadata of each dataset I presume? I have more questions regarding this point, but will wait on them in case there’s a spec already.

John

ps. we have use cases where the downsample factors differ per dimension.

In zarr-specs#50 we punted on that issue. That would be a next step, which certainly promises to be be exciting. :wink:
~J

1 Like

Ooof, I was even mentioned in that thread, shame on me. Will try to stay more on top of it.

Thanks again for for summarizing and pointing to it here :smile:

1 Like