'Cluster My Data' plugin (including HDBSCAN, DBSCAN, and more)

Hi all,

I’ve spent a few days creating a data clustering plugin for ImageJ, which some of you might get use out of. It uses an awesome java library ‘clust4j’ (github repo) made by Taylor G Smith (and distributed under an Apache licence!). Even if my plugin doesn’t interest you, his library might find a place in your respective projects.

My plugin ‘Cluster My Data’ (i’m not the best at naming these), takes data from the results table generated by the default scatter plot (the one generated with the ‘List’ button on the bottom frame of the plot) and applies a specified clustering algorithm to it, before generating a new plot which is coloured by cluster.
For example, from this already ground-truth clustered UMAP plot of mnist numbers 0 to 3:

The adjacent table, entitled ‘Plot Values’, can be generated using the ‘List’ command.

With this table open, the ‘Cluster My Data’ plugin can be called. If the data is already clustered, as above, then the plugin will first concatenate data-groups. In case it isn’t obvious, the plugin will work on any table entitled ‘Plot Values’, so long as the contained data is 2-dimensional (e.g. X, Y values). Clustering is especially useful for multi-dimensional data that has been reduced to 2-dimensions (for example with PCA, t-SNE, or UMAP). NOTE: I’ve also recently created a UMAP plugin for imageJ but it isn’t quite ready to share.

‘Cluster My Data’ can be called from a macro with:

//without specifying a clustering algorithm, the plotted data will be un-clustered
run("Cluster My Data");

which in the above case, will actually remove the pre-existing clusters:

However, with:

//HDBSCAN assigning optional parameters 'min cluster size' and 'min points'
run("Cluster My Data", "hdbscan min_clus_size=40 min_points=20");

the following plot is generated, from which all clustered datapoints can be retrieved:

NOTE: data can be pre-clustered or un-clustered for the plugin to work

The plugin can also be called from drop-down shortcuts in the gui via Plugins>Cluster My Data>…
And from which a dialogue box will appear asking for optional clustering parameters. For instance, the following is presented when ‘K-Means’ is selected from the drop-down:

Currently, I have only included the following clustering algorithms, and their optional parameters, in the plugin:

  • K-Means run("Cluster My Data", "k_means=4");
  • DBSCAN run("Cluster My Data", "dbscan epsilon=0.4 min_points=10");
  • HDBSCAN run("Cluster My Data", "hdbscan min_clus_size=40 min_points=20");
    (above showing some arbitrary optional parameter assignment)

But Taylor’s library additionally offers these:

  • k-medoids
  • affinity propagation
  • hierarchical agglomerative
  • mean shift
  • Nearest Neighbours
  • Radius Neighbours
  • Nearest Centroid

All of which can be easily added to the plugin with a little more time.

My plugin can also output to .csv as an option or colour-label the datapoints with a user supplied .csv file of ordered labels. However, I didn’t spend too much time optimising these last two functions as they are secondary to the main function of the plugin.

As this is still in development, I haven’t requested an update site for the plugin, so feel free to access it via this google-drive link:
Google Drive link
In which folder I have also added two test scatter plots to play with.
To install, just copy the ‘Cluster_My_Data-0.0.1.jar’ to the plugins folder.

FINAL NOTE: I noticed that HDBSCAN will sometimes fail. I think this happens with smaller datasets, and may be to do with which sub-implementation is chosen by the library. Taylor does state that his library may not be 100% ready for production environments and the problem wasn’t frequent enough for me to want to remove HDBSCAN from the plugin.

Apologies for the long post. I’m also sorry if another awesome implementation of this kind of clustering already exists for ImageJ or Fiji that I just wasn’t aware of.

Kind regards.

*EDIT multiple edits for spelling and I can’t get those pics to line-up.