Tissue microenvironments - KMeans unsupervised clustering!

Goal: Use unsupervised clustering to find regions of similar clusters of cells (microenvironments) within complex tissue. Inspired by an excellent talk from the spatial analysis series here: https://youtu.be/3IaEw_mlS4o

This process assumes you have already followed the documents to generate multiplex classes as shown here: https://qupath.readthedocs.io/en/latest/docs/tutorials/multiplex_analysis.html
You can use any other kind of classification scheme, but you will need to edit the scripts to match the input!

And to give you an idea of what I am working with:

https://downloads.openmicroscopy.org/images/Vectra-QPTIFF/perkinelmer/ (It might look familiar from the multiplex workflow in the docs!)

Once you have complex classes, you can then perform an initial cluster analysis using Analyze/Spatial analysis/Delaunay cluster features 2D. The distance you choose will be project and tissue dependent, I chose what I felt was a fairly long distance to pick up cells slightly “past” the first line of cells, in case there was a gap.
25 vs 35 microns

What does that get us?
Well, the Num neighbors represents what we are interested in, but that doesn’t tell us anything about them!

Time for the first script, included here!
Script uses Delaunay clusters (which must be run first) to add measurements to 
each cell based on the BASE classes of each of that cell's neighbors. The script 
would need to be modified for complex classes.
QuPath 0.2.1
Michael Nelson July 2020 - modified from Pete's script here: 
Set classSet = []
List classList = []
for (object in getCellObjects()) {
    parts = PathClassTools.splitNames(object.getPathClass())
            classSet << it
    //If you only want one sub-class, use it == getPathClass("MyClassHere") instead
   it != null 

def connections = getCurrentImageData().getProperty("OBJECT_CONNECTIONS")

getCellObjects().each {cell->
    listOfConnections =  connections.getConnections(cell)
        i = 0
            if (con.getPathClass().toString().contains(cl.toString())) {i++}
        cell.getMeasurementList().putMeasurement("Nearby "+cl.toString()+" cells", i)

As Pete notes in the post linked in the above script, the Delaunay feature may get reworked at some point, but these scripts are good as of 0.2.1.

Running this script will add a list of every base class to each cell. Open this for more details.

The measurement will be the number of cells in direct connection with the current cell that contain that base class.
image image
The 5 teal cells to the right and below the selected (yellow) cell are all CK positive, and the two cells to the upper left are CD68/PDL1 positive. The cell directly above, in red, has no class, and so is not represented. If you want to include such cells in your analysis, make sure to change unclassified cells to negative!
Unclassified cells are currently ignored!

Great, so now each class has a set of measurements that describe its closest neighbors, to some degree. Or at least the type of markers in the environment around it. What can we do with that?

Well, first off, if we knew what kinds of environments we were looking for, we could draw training regions, select the “Nearby” measurements, and train a normal machine learning classifier to find those regions. That is pretty standard, and aside from which measurements you select is well covered in the docs.

What I am covering next, however, is using unsupervised learning with KMeans to find clusters of cells in similar environments. There is one variable you will need to alter, and that is the K of KMeans. In short, K is the number of classes you want to divide your data into. I strongly recommend playing around with various K values, as the output can look very different. The script is almost the exact same as what was used for the DBSCAN script for clustering, found here.

Second script, contained within!
// Kmeans++ implementation for QuPath - Michael Nelson, July 2020
// Based heavily on code from https://bhanuchander210.github.io/Tutorial-Machine-Learning-With-Groovy/
//QuPath 0.2.1
//Adjust k to the number of classes you want the cells to be distributed into.

//Number of groups you expect to find.
int k = 4

print("Please wait, this can be very slow!")
//Get the classes you want to analyze. Avoids Negative and no class by default.
nearbyList = PathClassifierTools.getAvailableFeatures(getDetectionObjects()).findAll{it.contains("Nearby ")}

batch = getCellObjects()
    //Storage for stuff we do later. points will hold the XY coordinates as DoublePoint objects
    List<DoublePoint> points = new ArrayList<DoublePoint>()
    //The Map allows us to use the DoublePoint to match the list of coordinates output by DBScan to the QuPath object
    Map< double, DoublePoint> pointMap = [:]
    //print batch.size()
    //Prep each object being analyzed for clustering.
        //create the unique identifier, if you want to look back at details
        //Reset previous cluster analyses for the given cell
        toRemove = PathClassifierTools.getAvailableFeatures(getDetectionObjects()).findAll{it.contains("K-Means ")}
        toRemove.each{removeMeasurements(qupath.lib.objects.PathCellObject, it);}

        //create the linker between the ID and the centroid
        double [] point = new double[nearbyList.size()]

            point[j] = measurement(d, it)

        DoublePoint dpoint = new DoublePoint(point)
        //print dpoint
        points[x] = dpoint
        //Key point here, each index (cell in most cases) is tracked and matched to it's XY coordinates
        pointMap[(double)x]= dpoint 

    //print points if you want to see them all
    def showClosure = {detail ->
        //println "Cluster : " + detail.cluster + " Point : " + detail.point + " Label : "+ detail.labels
        //find all cells that have this "point"/set of measurements
        pointMap.forEach((key,val) ->{
            if (val.equals(detail.point)){
                //this uses the label (the index from the "batch") to access the correct cell, and apply a measurement with the correct cluster number
                batch[key]?.getMeasurementList()?.putMeasurement("K-Means Clusters",detail.cluster )
                batch[key]?.getMeasurementList()?.putMeasurement("K-Means Cluster "+detail.cluster,detail.clusterSize )
    //Main run statements
    KMeansPlusPlusClusterer kMean = new KMeansPlusPlusClusterer(k)
    collectDetails(kMean.cluster(points), pointMap).each(showClosure)
print "Done!"

//Things from the website linked at the top that I messed with very little.
//Used to extract information from the result of DBScan, might be useful if I play with other kinds of clustering in the future.

//Passing all of the clusters, which are collections of points, and the pointmap from above that is CellNumber,NearbyInfo
List<ClusterDetail> collectDetails(def clusters, pointMap)
    List<ClusterDetail> ret = []
    clusters.eachWithIndex{ c, ci ->
        c.getPoints().each { pnt ->
            DoublePoint pt = pnt as DoublePoint
            ret.add new ClusterDetail (ci +1 as Integer, pt, /*pointMap.get(pnt),*/ c.getPoints().size())

class ClusterDetail
    int cluster
    DoublePoint point
    double labels
    int clusterSize
    ClusterDetail(int no, DoublePoint pt, /*double labs,*/ int size)
        cluster = no; point= pt; /*labels = labs;*/ clusterSize = size

import org.apache.commons.math3.ml.clustering.KMeansPlusPlusClusterer
import org.apache.commons.math3.ml.clustering.DoublePoint

See the first picture (all the way at the top of the post) for the cell/tissue/class distribution, roughly. The results can be visualized through Measure/Measurement Maps, or exporting the detections and sorting by their K-Means cluster.
K = 2, unsurprisingly, finds the tumor (CK positive) areas, and then everything else as two groups.
K=3, now things get more interesting, with high PD-1 and PDL1 areas separating out from the rest of the stroma.
K=4 gets a little more nuanced as the rest of the stroma is split into two groups. I did not pick this up by eye at first,

but when I looked at the split in Excel,

it became clear that a highly PD1 environment was the primary difference.
The previous purple stroma is now split into purple and… blue-ish? I guess?

I will leave it at that, but I am certain I could pull out more data by increasing the value of K!

Please let me know if you have any thoughts, comments, or recommendations on where to go next!


I have added a demo project here in case anyone wants to test it for themselves. QuPath 0.2.0-0.2.2!

Scripts are included, along with the DBSCAN script in case anyone wants to play with that as well.

The KMeans script tends to run slowly the first time, and then subsequent runs are much faster.

If you run this project from your own system, you will probably need to re-point to the image that is contained within the project folder (QuPath should pop up a dialog when you try to open the project).

This is exactly what I am doing in CytoMAP as well. I import the cell objects as a table with x,y,z position and channel mfi then calculate the number of cells in 30-50 micrometer radius spheres eveny distributed throughout the space. Then I feed this table of numbers of cells per neighborhood into kmeans or self organizing map clustering algorithms. I added a few options for determining K, like Davies, or the gap function but often manually choosing the number of regions is very useful. I have a bunch of built in tools to look at what makes the regions unique and compare the region composition of groups of samples.

1 Like

I really wanted to do something more like this, and while I do have scripts that accomplish exactly that… they are very, very slow. I suspect if Pete ever works something like this into QuPath as an official function, it would run much faster. But for now, I would need to find some way around the N^2 problem given the hundreds of thousands to millions of cells that are standard for data sets.

I had not looked into ways of automating K selection, but that is an interesting idea. I also liked what was mentioned in the talk that spurred this burst of coding; they mentioned deliberately choosing a very high K value and then merging similar types of clusters manually to help deal with batch effects (this had to do with classification as well, which my script avoids handling :slight_smile: ).

If there is enough interest, my next step might be merging this type of analysis across an entire project, and then distributing the results back out to all of the images.

Adding a link to the CytoMAP post in case anyone wants to check out some more robust tools. Getting the results back into QuPath would be the interesting part, though.

If you build the neighborhoods in an overlapping grid pattern then the field of view and not the number of cells is the limiting factor. I can analyze volumes with a few million cells broken into only a few thousand neighborhoods in a few minutes this way. If you build neighbors around each cell it takes forever to run, but sometimes is more informative depending on the question.

I really liked this approach in the talk as well and have a built in way too do this in CytoMAP. One thing this is tangentially really nice for is forcing you to really think about what the regions are. You can’t stop at the pretty colorful plot of regions if you have to annotate what those regions mean first. It also helps get a better grasp of how the clustering algorithm is splitting things.

One other thing I have found useful in clustering tissues is having a bunch of options for ways to normalize data before clustering. Sometimes I want to define regions based on changes in local composition but other times it is useful to use total number of cells. Along those same lines weighting cell types individually can be helpful, especially when looking for changes in the environments of rare cell types.

I wish I would have discovered QuPath way sooner. I’m not familiar with what format you need to import things but this is on my list of things to do. Currently you can export data as csv tables. You can either export the neighborhood position and region number, or cell data table with an added region number column.

1 Like

From a CSV it is really pretty easy to get the data back into QuPath; just realize that new measurements have to be some sort of number (can’t be strings). As long as you prep by numbering the cells in QuPath first, you can use that number/index to add measurements from external CSV/TSV files. You need to do this all in a script at the moment, though, using Groovy to read in the CSV file into an array or map.

An example of how this might look was linked to in my DBSCAN post:

They created the index as an “ID” variable, then exported the data as a CSV, then did something with it, then added new columns from the altered CSV back into QuPath’s measurements per object.

1 Like

Thanks for posting this Mike_Nelson, I’m really interested in this approach!

I’m trying to figure out a similar method in QuPath for creating cell IDs using a KMeans clustering approach (Phenograph). We’re working with high-dimensional image data (40-60 markers), so right now I’m using the method you just linked to; clustering outside of QuPath, creating cell IDs, and transferring the annotated CSV back into QuPath.

It would be really interesting to do both analyses on the same tissue (ideally within QuPath); classify cells with phenograph clustering, and then classify their local environment with this neighborhood clustering. I’m going to try out your scripts now, and I look forward to any updates!

1 Like

Have you considered implementing an auto-estimate feature for determining K?

Here’s an example of this implementation in Python.

As far as I understand it, it uses the elbow method to determine the K with the lowest error (where K is the optimal number of clusters in which the centre of clusters has the shortest distance to each cluster data points). It uses Python packages, so I don’t know the underlying algorithms/maths to calculate this, but I’m sure if you dig deeper you’ll find it.


I had not originally thought about this, but given the greater interest in this script vs the others I have posted, it is certainly something I can look into. I suspect it would be a good idea to gain an understanding of various methods to determine this number automatically (and what their flaws might be)!

1 Like

I have had a lot of questions about how to determine the number of regions in tissues. I am still not convinced this has a good answer and ultimately depends on the question you are answering.

Despite the vagueness of this question I still think it is important enough to have some options. Currently I have Davies Bouldin, Calinski-Harabas and the Gap function in my UI. It would be nice to eventually add some form of a comparison tool to look at the tissue with a few different options in the numbers of regions.

I think ultimately the number of regions you go with will depend on how interpretable they are, i.e. if you define 100 regions that will be difficult to understand. I think it is also important to have a way to combine and annotate regions.

Another thing to consider is how much of the local environment you sample. You can and up with fairly different regions, or miss things if you make your neighborhoods too big or end up with way too much information if you make them too small.

I talk about some of this on the CytoMAP Wiki: https://gitlab.com/gernerlab/cytomap/-/wikis/Classifying-Neighborhoods-into-Regions

1 Like

Hi @Mike_Nelson, thanks for sharing the script and demo, amazing analysis indeed. I do have a question about the Spatial analysis/Delaunay cluster. After you run the spatial analysis, it seems that your detected cells change the annotation color accordingly, I, however, couldn’t see any change in the result after press “Run”. Can you show me how did you do that?


1 Like

@minhtran1309, to visualize the clusters, you need to use the Measurement Maps that are shown in most of the images. You can get to them here:

Also, the clusters will only be as meaningful as the classifications that are fed into the analysis!

1 Like

Thanks for this - I also had a play with the zip file you provided (great idea - thanks). I have a very basic question - I’d like to show greater numbers of clusters. I used the ‘measurement maps’ tool to display them, which is great but only has a limited number of distinguishable colours. Can I modify the LUT/output in QuPath to be able to display a larger number of clusters as distinct colours?

1 Like

Hi @kap

As far as I know, the best way to handle that would be making a custom lookup table.

I have made a few of these, and it is fairly easy to interpolate between two or three colors, though making one where all of the colors were different could be very time consuming. If you find a collection of colormaps that can be easily converted into a QuPath friendly format, please let me know!

Come to think of it, if you really wanted to, you could duplicate the image and data (right click on the image in the Project tab), and then reclassify the cells based on only their Cluster value. You could then control the color of each cluster manually through the Annotations tab (where you can adjust the classification colors).
This script should work, but realize that it will OVERWRITE your classifications. So best to do this on a duplicate image.

    cluster = measurement(it, "K-Means Clusters")
print "Cells reclassified to cluster value"

Second step: context menu in Annotation Tab.

1 Like

Thanks - that looks great. I’ll give it a try!