Help getting small images out of big zip file from urls

Following this big data release from Recursion Pharmaceuticals of ~450GB of image data from a SARS-CoV-2 Virus screen available for download as a zip at the bottom of the linked page - I thought it might be fun to try to browse through the data with napari, but without having to download and unzip the whole thing.

In my ideal world I would be able to point something like dask-image at the zip, give it some information about where to find what images and get back a dask array that I could lazily index into.

I found this python remote zip library (pip installable) that allows me to read into parts of the zip file, but it is still quite slow (>10 s) here is my example of just getting one png 1024x1024, 8-bit.

from remotezip import RemoteZip

path = 'https://storage.googleapis.com/rxrx/RxRx19a/RxRx19a-images.zip'
file = 'RxRx19a/images/HRCE-1/Plate1/AA02_s2_w2.png'

with RemoteZip(path) as zip:
    image_data = zip.read(file)

Note to convert to an actual image you have to do

from napari import view_image
from imageio import imread

image = imread(image_data)
view_image(image)

which looks nice!

but is too slow to think about putting everything inside a dask array and scrolling through it.

Curious if anyone has tried something like this before and has any ideas

2 Likes

@sofroniewn were you able to download the zip file locally and check how long it takes to do a random-access read in a local zip file? (The stdlib zipfile module can do this.) It’d be good to know how much of that 10s is due to having remote data and how much is due to zip random access being hard in general.

1 Like