DBScan Cluster Analysis in QuPath

Script to implement DBScan in QuPath.
DBScan described here: https://en.wikipedia.org/wiki/DBSCAN

Script hidden here
// DBScan 1.2 implementation for QuPath 0.2.0 - Michael Nelson, May 2020
// Based heavily on code from https://bhanuchander210.github.io/Tutorial-Machine-Learning-With-Groovy/
// With major suggestion from Sara Mcardle
// Instigated by Colt Egelston
// Version 2.0 Added cluster size as a measurement

//Probably should read this stuff the second time after it doesn't work quite right the first time.
////////////////////////////////////////////////////////////////////////////////
//micronsBetweenCentroids (which is converted into "eps" and minPts to adjust the behavior of the clustering.
//eps and minPts are as described in the DBSCAN wiki
//baseClasses to true if you want to ignore complex classes and use subclasses from multiplexing classifications

/////////////////////////////////////////////////////////////////////////////////


//distance to search around an object for another centroid.
double micronsBetweenCentroids = 30.0
//Minimum number of objects needed to be considered a cluster
int minPts = 5

boolean baseClasses = false
boolean ifExport = true

double eps = micronsBetweenCentroids/getCurrentServer().getPixelCalibration().getPixelWidthMicrons()
print eps

//Get the classes you want to analyze. Avoids Negative and no class by default.
Set classSet = []
List classList = []
if (!baseClasses){
    
    for (object in getCellObjects()) {
        c = object.getPathClass()
        if (c != getPathClass("Negative")){
            classSet << c
        }
    }
    
    classList.addAll(classSet.findAll{
        //If you only want one class, use it == getPathClass("MyClassHere") instead
       it != null 
    })
    print "Using full class"
}else{
    for (object in getCellObjects()) {
        parts = PathClassTools.splitNames(object.getPathClass())
        parts.each{
            if (it != "Negative"){
                classSet << it
            }
        }
    }
    classList.addAll(classSet.findAll{
        //If you only want one sub-class, use it == getPathClass("MyClassHere") instead
       it != null 
    })

}
toRemove = PathClassifierTools.getAvailableFeatures(getDetectionObjects()).findAll{it.contains("Cluster ")}
toRemove.each{removeMeasurements(qupath.lib.objects.PathCellObject,it)}

print classList
List export = []
Set classes = []
String text
classList.each{ c->
    //Storage for stuff we do later. points will hold the XY coordinates as DoublePoint objects
    List<DoublePoint> points = new ArrayList<DoublePoint>()
    //The Map allows us to use the DoublePoint to match the list of coordinates output by DBScan to the QuPath object
    Map< DoublePoint, double> pointMap = [:]
    
    //Get the objects of interest for this class or sub-class
    if(baseClasses){
        batch = getDetectionObjects().findAll{it.getPathClass().toString().contains(c)}
        text = c.toString()
    }else{
        batch = getDetectionObjects().findAll{it.getPathClass() == c}
        text = c.toString()

    }
    
    //print batch.size()
    //Prep each object being analyzed for clustering.
    batch.eachWithIndex{d,x->
        //create the unique identifier, if you want to look back at details
        //d.getMeasurementList().putMeasurement("ID",(double)x)
        
        //Reset previous cluster analyses for the given cell
        
        
        //create the linker between the ID and the centroid
        double [] point = [d.getROI().getCentroidX(), d.getROI().getCentroidY()]
        DoublePoint dpoint = new DoublePoint(point)
        //print dpoint
        points[x] = dpoint
        
        //Key point here, each index (cell in most cases) is tracked and matched to it's XY coordinates
        pointMap[dpoint]= (double)x
    }
    
    //print points if you want to see them all
    def showClosure = {detail ->
        //println "Cluster : " + detail.cluster + " Point : " + detail.point + " Label : "+ detail.labels
        //print "labels "+(int)detail.labels
        //print "cluster"+detail.cluster
        
        //this uses the label (the index from the "batch") to access the correct cell, and apply a measurement with the correct cluster number
        batch[detail.labels]?.getMeasurementList()?.putMeasurement("Cluster "+text,detail.cluster )
        batch[detail.labels]?.getMeasurementList()?.putMeasurement("Cluster Size "+text,detail.clusterSize )
        classes << text
    }
    
    //Main run statements
    DBSCANClusterer DBScan = new DBSCANClusterer(eps, minPts)
    
    collectDetails(DBScan.cluster(points), pointMap).each(showClosure)

}

if (ifExport){
//Setup for data export
def header = 'Annotation,Cluster class,Cluster number, Cluster size'
def fileName = buildFilePath(PROJECT_BASE_DIR,getProjectEntry().getImageName() + '.csv')
print "Exporting to " + fileName
print "Wait for it...."

//create list to export
export = []
getAnnotationObjects().each{anno ->
    
    classes.each{ cl ->
        cells = anno.getChildObjects().findAll{it.isCell()&& it.getPathClass().toString() == cl}
        //for each class within each annotation
        Set clusters =[]
        //gather information about the number of clusters
        cells.each{ cell ->
            //I don't know why I need this, but I get NA lines otherwise
            if (!measurement(cell,"Cluster "+cl).isNaN())
            clusters << [anno.toString(), "Cluster "+cl,measurement(cell,"Cluster "+cl),measurement(cell,"Cluster Size "+cl)]
        }
        clusters.sort()
        clusters.each{ export << it}
    }
}

new File(fileName).withWriter { fileWriter ->
    fileWriter.println(header)
    export.each{
        
        fileWriter.println(it.toString()[1..-2])
    }
}
}
print "Done!"


//Things from the website linked at the top that I messed with very little.
//Used to extract information from the result of DBScan, might be useful if I play with other kinds of clustering in the future.
List<ClusterDetail> collectDetails(def clusters, pointMap)
{
    List<ClusterDetail> ret = []
    clusters.eachWithIndex{ c, ci ->
        c.getPoints().each { pnt ->
            DoublePoint pt = pnt as DoublePoint
            ret.add new ClusterDetail (ci +1 as Integer, pt, pointMap[pnt], c.getPoints().size())
        }
    }
    ret
}

class ClusterDetail
{
    int cluster
    DoublePoint point
    double labels
    int clusterSize
    ClusterDetail(int no, DoublePoint pt, double labs, int size)
    {
        cluster = no; point= pt; labels = labs; clusterSize = size
    }
}

import org.apache.commons.math3.ml.clustering.DBSCANClusterer
import org.apache.commons.math3.ml.clustering.DoublePoint

Somewhat of a continuation of my Hotspot post, the main reason I wanted to try DBScan was the issue with Delaunay clusters where a single cell getting in the way could disrupt a cluster, shown in this post. In some cases, when you only want a bunch of touching objects, this is great! Biology is often messy, though, and I wanted to look at clusters of objects that might have other cells in between them. Thus, DBScan:
image
Best described by the image above, where you need a minimum number of points for a cluster, and a distance. Any point that has minPoints within range is part of the base cluster, and any cell within range of one of those cells, is also part of the cluster, allowing for some “stragglers nearby” to be counted as part of the cluster.

The actual work of the script is mainly through Apache commons, which also has options for k-means clustering, and a few others that wouldn’t be too hard to implement using the same general coding.

To use the script

There are three main variables to edit for the script, as shown
image
micronsBetweenCentroids is the radius of the circles shown above; also read as the distance between the centroids of any two cells to check if they are part of a cluster. I generally adjust this until I am getting results that match with what I visually expect for a “cluster” in a given study.

minPts determines the density, essentially. It is the number of cells that all need to be within the same distance of each other.
Do not mistake this as the minimum size of your cluster!!
It is effectively that, but that is not how you should be using it. Think of it as density. In the case of the values shown above, each cell would have a circle with a radius of 15 microns; to be counted as part of a cluster, there would need to be at least 5 other cells with centroids within that 15 micron radius circle. If I set this value to 60 because I only wanted clusters of 60 cells or greater, I would need to fit 60 cells within 15 microns of any given cell!

baseClasses is fairly straightforward. If you are using multiplexing to assign multiple markers/classes to a cell as described here, you can choose to analyze each individual class rather than the sum combinations of classes.
true: CD8 cells are analyzed, then CD68 cells are analyzed. CD8+CD68 cells are not analyzed.
false: CD8+CD68 cells are analyzed, then CD8 cells are analyzed, then CD68 cells are analyzed.

This can have a major impact depending on the type of question you are asking, and how many combinations of classes might be in your image.

One last thing you might want to adjust, on that note, is farther down in the script.

Click for location in code

image

The default is to analyze all non-negative clusters. But rather than != "Negative" you could instead use something like == "CD8" so that the script would only analyze CD8 positive cells.

The primary problem (so far as testers have mentioned) with this script is speed, as it rapidly slows down when analyzing large numbers of cells. Restricting analysis to only your markers of interest can save you a lot of time!

Results

Once you have run the script, there are no visible changes to the viewer. New measurements have been added to each cell’s Measurement list, however, and these can be viewed through Measurement Maps.


Using everyone’s favorite LuCa image, you can see in this image that each marker has two entries. “Cluster markername” displays each cluster with an incrementing number, from 1 to the number of clusters (in the image above there are 21 clusters).
“Cluster Size markername” contains the number of cells in that particular cluster, and allows you to see the sizes of clusters relative to one another.

Here I used the Svidro2 colormap and turned off all channels so that I could hide all clusters of 20 cells or smaller (they turn black).
Full range

Since all of these values are measurements in the cells, they can either be exported with the cells, or summarized and stored in the parent annotation.
Please feel free to ask if you have any questions!

3 Likes

Have you considered using HDBSCAN? I think one of the limitations of DBSCAN is that it assumes your clusters have similar densities, which may not be the case most of the time. This will often unintentionally merge two clusters together.

I played with clustering analysis some time ago, but in Python. I do not know how it would be transferable to Groovy and within QuPath.

If you haven’t come across this yet, here’s some nice comparison between clustering algorithms and at the bottom, they compare DBSCAN vs HDBSCAN.
https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

2 Likes

I didn’t even know about HDBSCAN! I will take a look, thanks!
Leaving myself this link here in case I ever get to looking into it. It seems it would be significantly less easy to implement in Groovy, but there is Java code available.
https://www.programcreek.com/java-api-examples/?code=tgsmith61591%2Fclust4j%2Fclust4j-master%2Fsrc%2Fmain%2Fjava%2Fcom%2Fclust4j%2Falgo%2FHDBSCAN.java#
I am just not really sure how to use it.

If you are mathematically inclined, this may be useful.

If you are like me and cannot understand the complex equations, this should help!

I previously made some slides that summarises the two main parameters to adjust for HDBSCAN and what they do.

TL;DR min_samples adjusts the threshold from the bottom, min_cluster_size adjusts the threshold from the top. Adjust both of these to get a sweet spot capturing your cluster ‘peaks’ as best as possible.


2 Likes

I am using it and it’s working great! Thanks!

1 Like

I have updated the original script to include a CSV file output listing the clusters hopefully by annotation, size and class. Note that there is no guarantee at the moment that your clusters could not cross over between annotations, and I haven’t tested what might happen in that case. If it is working as expected, I imagine you will see the cluster registered twice, once per annotation.

The output defaults to the project folder.
There is a variable at the top of the script, “ifExport” that can be set to false if you do not want to export the results.

I also fixed a bug related to removing old cluster measurements in the case a user is re-running the script multiple times with different settings.