Large Image Dataset Workflow Processing

Hey all,

I wanted to announce the release of a package that some of you may be interested in: ACTK (Automated Cell Toolkit)

A pipeline to process field-of-view (FOV) microscopy images and generate features and render-ready products for the cells in each field. Of note, the data produced by this pipeline is used for the Cell Feature Explorer.

You can find both the input data and the produced results here. (About 220000 individual single cells images available for download!)

Please do feel free to check it out!

I also just wanted to use this opportunity to talk about how we designed and the goals as it is a pretty neat (but weird) system for managing an image processing pipeline.

At a high level we utilized Prefect and Dask for the workflow management and distributed computing, but, it’s how we tailored it that I think is neat.

We wrote a custom task handler for Prefect that allows for each individual task in the pipeline to:

  1. run independently from the rest of the pipeline
  2. manage produced data upload
  3. manage produced data checkout
  4. manage upstream data dependency download

These things in combination with the entire pipeline means that all the data produced by the pipeline can be tracked in a near git style level both locally and whenever we decide to push the data out to the public.

Happy to answer any questions.