StarDist: structure of Data-Science-Bowl-data does not match the one required for Training a StarDist model


in part 3 of the Neubias talk on StarDist it was mentioned that for training of an appropriate classifier the EPFL adds the 15 annotated ROIs of the respective project to the dataset of the data-science-bowl-2018-train images ( Now I have annotated my pictures and wanted to proceed just like that, when I came to realise that the BBC dataset has a different structure with masks and images in subfolders which are named after the image file, and with the multiple masks not having a corresponding nomenclature. How do you solve this difference? I have tried to upload it to the Jupyter notebook and also to merge them into one images and one masks folder, but it gives me an assertion error due to the different file-names: So does anyone have an adjusted dataset (which he/she could share?) or an adjusted code of the notebook?

I have linked a picture of the different folder structures and am happy to give you further information.
Thanks so much!!!

Thiemo Möllenkamp

Hi Thiemo, Not sure if you’ve answered your questions already, but I was able to run the jupyter notebook examples on my own data. It worked great and my home trained model performs much better then default ones. The data structure is a big different then the data-science-bowl. There each object was isolated into a single image whereas in the stardist examples the masks are together in one image. Here’s a screenshot of my structure:

here is a script the StarDist folks provide to generate an appropriate mask from your ground truth annotated image:

good luck! -John

1 Like

Hi, that is what I am trying to do as well, only that I wanted to train the model with my own PLUS the DAB2018-set, in order to provide it with a sufficient basis for the “normal” cases of round nuclei. That is what I want the adjusted dataset for - but I have run into so many issues already, as there are both fluorescence and H&E images included, and each mask contains only one nucleus, and so on.

So @johnmc would you mind telling me how many ROIs were the basis of your model? Maybe it indeed works fine with an appropriate number of self-annotated images…

I used 8 images from two different treatment groups and the test was another 4 images from a third treatment group. I used what I had which were 1860 x 1396 pixel size images despite the suggestion that smaller images might be a sufficient. There are roughly 1000 cells in each image. If you want to combine training sets then yes, you will have to slog thru and make sure everything is compatible. Why do you need a model that can do everything?

Hi @TMM-98,
the authors of stardist provide a curated DSB dataset for their example notebooks, which fits the needed structure and should run out of the box. See example notebook here
or directly assets of initial release here.


that is exactly what I was searching for! I wasn’t able to access the files from the notebook before, but the initial release is working fine - thanks!!

@johnmc - you are right in principle, I don’t need my model to be able to deal with everything. But I wan’t to make sure that the basic nuclei detection is bulletproof and therefore want to train my model with as many “normal” cases as possible, since my main problem was that the standard model refuses to recognise my DAB+ brownish nuclei. So I take the DSB data as a basis and add about 50 images containing DAB+ cells. Thanks for your help!