Training stardist on a mixture of cell types - class imbalance

Dear forum,

I have a 20Tb dataset containing fluorescence images of cells circulating in the blood of cancer patients. I would like to use StarDist to segment the whole cells from immunofluorescence images of DAPI (nucleus), CD45–APC, and epithelial cytokeratin-PE. The segmentation will be used as an input to a DL network for classification.

Stardist works amazing when I train and test it on culture cell images (thank you to the stardist team for making it so available!). However, in patient samples, stardist tends to segment the nucleus, see the image below for some CK+ cells. The vast majority of cells in patient samples is a nucleus (possibly with a thin shell of CD45-APC), and <1% of cells is a nucleus with cytokeratin-PE. So I think the data doesn’t contain enough examples of PE+ cells and training focuses on segmenting the CD45+DAPI+ or just DAPI+ white blood cells.

afbeelding

  • Is it possible to overweight the PE+ events during training to compensate for this imbalance? For example, would it make sense to add a mask to indicate the weight of each event so the loss can be weighted for different classes?
  • I could train on smaller cutouts of the images containing PE+ cells (and the nearby white blood cells). The larger the cutout, the more PE- cells I include, so how large does the cutout need to be? The train_patch_size is (256, 256), does this mean 256x256 is the minimum?
  • I could add images of epithelial culture cells to the dataset. They are PE+, DAPI+ so I would get more training data in the direction I need to improve, but culture cells tend to be much brighter in PE and DAPI and larger than patient tumour cells. Would that help?

Some background:
The data is acquired using an FDA IVD-cleared system, and the dataset contains images from 1000’s of patients so I can’ t acquire new data (e.g. with an additional stain). After segmentation, I plan to use a DL-classification network that will take the segmented event plus a margin around the event, so some over/under segmentation is not a big problem.

The stardist configuration is pretty close to default: grid = (2, 2); n_dim = 2; n_rays = 32; n_channel_in = 3; net_conv_after_unet = 128; train_background_reg = 0.0001; train_batch_size = 4; train_learning_rate = 0.0003; train_patch_size = (256, 256)

My training set contains 57 images of ~1250x1000 pixels with ~ 6000 events, ~100 are circulating tumor cells (CK-PE+DAPI+), the rest is CD45-APC+DAPI+ or DAPI+, both probably white blood cells.