Which is the best way to validate a model?


can you give me suggestions about the best way to validate a model created for automated counting?
I am comparing the number of cells predicted by the model with the real number of cells that I previously count.
The problem is that even if the numbers are very similar If I look carefully there are errors that compensate each other for example false positve cells and false negative cells.
I am searching for something objective to apply like a formula.

Maybe this will help if you want accuracy on a pixel to pixel basis.

If you want a pure formula, there are several you can use:

See the figure at the bottom. The different metrics depend on what is more important to you. For example, false negatives might be really bad if a false negative results in a death, but a false positive results in another test which happens to be a minor cost. So you choose the metric (or metrics) that matches your specific case.

Sure, that’s why it’s not a very good metric to look at. Typical approaches:

  • Use overlap-based detection-style evaluation as done in the StarDist Jupyter notebook examples, i.e. by using the fact that your annotated objects have a certain location and area and so do your StarDist predictions.
  • Define predicted locations to be correct if they are within a certain radius of a true object (and enforce that only one prediction matches to a true object).

I suggest to simply google for object counting papers and see how they do this.

(As far as I understand, you are working with @constantinpape on this project. May I ask why you didn’t ask him first? I’m quite confident that he knows the answer to such questions.)

Thanks for your reply @uschmidt83. We were looking for additional input to this problem and I didn’t have time to write something up, so @Mariya_Timotey_Mitev reached out here.

Regarding your answers:
I agree in general the count is of course not a good metric, but in this case it is the result we are actually interested in + we have a lot more images with just count annotations than fully annotated.
(We tried pure counting methods initially, but they never worked out quite as well compared to segmentation with stardist).

Also, indeed, I think the issue here is that there are a lot of out of focus / dimly lit nuclei.
And so far I have only trained with rotation / flip augmentations.

So indeed adding some noise and intensity augmentations could help here!


You are probably already thinking about this, but just in case…

Depending on your imaging, dimly lit and out of focus nuclei also tend to have dimly lit and out of focus markers of interest. It is sometimes best to avoid them unless you are confident your classification methods downstream will be able to pick up and correctly label the cells of interest vs either real background or sample background (if there is a low normal expression of the protein in negative cells, that might appear to be the same as an out of focus high expresser).


That’s a good point.

For the current setup all nuclei are relevant though, because we are running the segmentation on the NeuN channel directly.
(We have a DAPI channel too and have tried counting on DAPI as well, but surprisingly the results were worse than on the NeuN channel.)

But thinking about this more, filtering out low intensity nuclei might still make sense:
As far as i understand the absolute number of nuclei is not so relevant but rather the relative difference in count between different samples. So filtering out the low intensity objects might help to give a better relative measure.


About the idea of ‘’ filtering out the low-intensity objects might help to give a better relative measure.’’ it makes sense because the majority of the errors are in these cells with low intensity. It is also true that for this specific case this automatic counting at the end has to evidence a difference between two different groups rather than the absolute cell number. In fact for this reason in theory it is also possible to accept an error if this error is fixed and is present in the same measure across all the images. But this is not the case, the relative error of prediction is changing across images in one image is 0.8% in a second image is 3% and so on.
I am not completely sure that the quantity of these low-intensity cells is the same across all the images, if this is the case we can normalize in this way. Is it possible to set somehow in the training to do not consider cell signals that are below a certain intensity?

I am not sure about that in StarDist itself, but you could certainly remove objects below a certain mean intensity within QuPath or FIJI (check ROI mean value for a given channel).

Depending on how well defined the nuclei in your image are, you might also consider checking how blurry the area within the ROI is to help decide whether to remove it from analysis.