Approaches to systematically quantify and compare cell segmentation quality

Dear community,

I am currently looking into measures to quantify segmentation quality to systematically compare segmentation approaches.

General approaches I identified from the literature are:

  1. Use fully labeled ground truth masks and match masks with ground truth masks using the intersect over union criterion. Then calculate F1 score, false positives, missing, merges, splits…
    Eg used here:

  2. Using manual classification of objects as under- over and well segmented.
    -> Can be combined with supervised learning to build a classifier to do this more automatized

  3. Other criteria such as size distribution, expected number of cell objects, …

Are there some other important measures I missed? Could you recommend some good papers either reviewing/establishing or simply using other ways to evaluate segmentation quality?

I would be particularly interested in measures that do not require fully labeled ground truth, as this seems rather difficult/unreliable to get in the images I am working with.

Thanks already for any hints!


Hey @votti,

Great question! In your category 1 also fall Jaccard Index and Sorensen-Dice-Coefficient. Both are related, so using one of them might be enough.

You may also have heard about receiver-operating-characteristic (ROC) curves and the area under the ROC, it goes a bit beyond (true/false)-(positive/negative) and provides a wider picture.

Depending on the field you work in, other metrics such has contour-distance and Hausdorff-distance may be important. Those measure physical distances between outlines. That’s relevant in radiotherapy for example. Measuring the signal intensities inside the regions of interest and comparing them is relevant in nuclear medicine / radiology.

I could imagine to use other segmentation algorithms for comparison instead of manual ground truth. If multiple published, peer-reviewed algorithms vote for a pixel being positive, that might be a stronger argument than a single individual expert annotating it as positive. Here come sone terms into play: Inter-observer-variability, intra-observer-variability, inter- and intra-algorithm-variability and comparison of automatic and manual annotations. You could argue that if one algorithm provides segmentation results which are closer to results from three experts than these three experts are to each other, you should prefer the automatic approach over the manual. Proving this empirically can be very challenging btw.

Great to have such a thread. I’m looking forward to read more ideas from others. Thanks for asking @votti :slightly_smiling_face:


If your images are difficult to segment, either manually or automatically, then maybe using what’s considered state of the art for this type of image can be used as baseline for comparisons.
As is often the case, the best approach depends on what your goal is and the type of images. You may be able to design an experiment to specifically test your segmentation approach, for example by comparing your output using one channel (e.g. transmitted light) to segmentation in another channel where the problem is easier (e.g. a fluorescence channel where only the structure of interest is labelled).

1 Like

Dear colleague,

You can apply the Jackard Index

Even better is the apply the **Jaccard distance ** because it is a true metric on all finite sets so you can argue about proximity etc.

However, if the ground truth is not fully segmented instead of taking the area of intersection (which requires complete segmentation) you can just take a hit A \cap B \neq \emptyset - which allows for incomplete segmentation.

best regards,