Cell detection from command line / script slower than when using GUI

[ This was originally thought to be a vCPU issue but is actually a command line / headless issue… leaving original text in place]

I am experimenting with using QuPath on a virtualized PC instance in the cloud, rather than using local hardware, and I am running into a specific performance issue.

Running QuPath on a vPC Ubuntu works…

I can run QuPath 0.2.1 in the cloud using the standard Linux distribution. In this specific case I’m running a script to:

  • use watershed cell detection
  • classify images using a legacy classifier

but it is very slow.

This works on a range of EC2 instances in AWS. It just doesn’t run fast - in fact it’s about 10x slower than I would expect.

The hardware I run on locally (2016 MacBook Pro with NVME SSD, 16GB RAM, 4-core processor) is in may ways inferior to the EC2 instances I am trying, but will still take only 20s per tile when detecting cells.

The EC2 instance (I’ve tried 8/16/32 vCPUs, 32 to 64GB of RAM) is taking from 2minutes and sometimes 3 minutes per tile. This means on my Mac I can process a 1GB in 2m but the cloud PC takes 25-30m.

Some data about timing

The only clue I have that something is off is the elapsed time per tile - on my Mac these values are consistent, usually falling between 20-40s. On AWS however, the tiles appear to run in groups but the timing aggregates until all of the current batch of tiles are analyzed, and then the timing numbers drop back to a reasonable amount:

(in reality this was run on a 16-vCPU machine but I’m trimming some of them to show the pattern)

21:29:19.569 [plugin-runner-0-6] [INFO ] ...  1296 nuclei detected (processing time: 34.07 seconds)
21:29:35.026 [plugin-runner-0-4] [INFO ]  ...   1326 nuclei detected (processing time: 49.53 seconds)
21:29:50.001 [plugin-runner-0-10] [INFO ]  ...   6 nuclei detected (processing time: 64.50 seconds)
21:30:09.411 [plugin-runner-0-15] [INFO ] ...   1148 nuclei detected (processing time: 83.91 seconds)
--- deleted some lines here ---
21:33:21.787 [plugin-runner-0-3] [INFO ] ...   1616 nuclei detected (processing time: 276.29 seconds)
21:33:38.596 [plugin-runner-0-16] [INFO ]  ...   840 nuclei detected (processing time: 293.10 seconds)
21:33:56.506 [plugin-runner-0-3] [INFO ]  ...  1585 nuclei detected (processing time: 34.72 seconds)
21:34:13.951 [plugin-runner-0-11] [INFO ] ...   1618 nuclei detected (processing time: 72.54 seconds)
21:34:31.075 [plugin-runner-0-2] [INFO ] ...  2610 nuclei detected (processing time: 103.80 seconds)
21:34:47.837 [plugin-runner-0-8] [INFO ]  ...   1817 nuclei detected (processing time: 139.09 seconds)
--- deleted more ---
21:37:37.295 [plugin-runner-0-4] [INFO ]  ...  - 874 nuclei detected (processing time: 482.27 seconds)
21:37:53.916 [plugin-runner-0-6] [INFO ]  ...   - 971 nuclei detected (processing time: 514.35 seconds)
21:38:12.347 [plugin-runner-0-4] [INFO ]  ...   1947 nuclei detected (processing time: 35.05 seconds)
21:38:28.143 [plugin-runner-0-10] [INFO ]  ...   1102 nuclei detected (processing time: 67.85 seconds)

This would imply I’ve either misconfigured multi-processing that should be working or the lack of the GUI is preventing something from being set?

Since cell detection is one of the things I know is parallelizable in QuPath (or looks to be based on the GUI output) has anyone figured out how to do this in the cloud?

1 Like

There are many ways things can be slower in the cloud. A common one is data access. If you don’t use directly attached storage, read and write are performed over the network and although latency should be low, this will typically be slower than with locally attached storage. So you might want to have a look at how you’re accessing your data. Also check CPU usage. There may be another slow process running due to how your system is configured. Although less likely with large instances, you might have run out of ECU and the hypervisor diverts some CPU resources to other instances (because this is shared hardware).

2 Likes

Thanks @jkh1 for the insights!

I’ve no experience of using QuPath in the cloud, and not much cloud experience at all, I can only say how things are implemented in QuPath currently… basically parallelization is via standard Java Executors and the number of parallel threads is initialized based upon whatever RunTime.getRuntime().availableProcessors() returns here (unless the user has requested something else through the UI).

Image tiles are also cached locally in a Java map, and all requested image regions are broken up into requests for individual tiles… so processing for a single region can require multiple reads from the original image source (to get all the tiles necessary for that region). But those same tiles can overlap with neighbouring regions that are being processed, and therefore processing the neighbouring regions can appear faster because they require fewer ‘new’ tile requests (since relevant tiles are already cached). Very slow tile reading might therefore perhaps explain some of the weird timing you are seeing.

At this point the main focus is on QuPath as a desktop application, but I’d be interested in discussing if there are any ways to make it more cloud-friendly without introducing a huge amount more complexity.

Something else that occurred to me. Try reducing the number of threads to see if performance improves. What can happen is that the overhead of context switches and system calls is too high or the program and/or the system can get “confused” with too many of them in a short loop.

1 Like

Thanks both @jkh1 and @petebankhead for the insights. I’m not much of a Java programmer but these are very helpful starting points as I poke around. I’ll report back on this thread if I can make any significant improvements or garner more info!

Just as info for future readers of this thread:

From Amazon.com:
How can I improve the CPU performance of my Amazon EC2 Linux instances?

Still investigating this but it turns out this is actually a headless / non-GUI issue instead of a VPC issue.

After more extensively profiling on my MacBook, I noticed the exact same behavior on my Mac when running a script from the command line: What logged as 128s of work in the automation editor in the GUI turned into 540s of work when I used the script from the command line, and I am seeing the same ‘stair-stepping’ behavior where each tile in a batch takes longer and then the timings ‘reset’ back to an amount comparable to the GUI.

I’m going to confirm this by running an Ubuntu instance with a GUI in AWS, running the script and seeing how the performance is. Because the larger instances can talk directly to hardware, I’m expecting to see improved behavior in the cloud when run that way.

That still means I can’t automate the cell detection, but I’m sure we can track down the necessary change. Seems like something is waiting for a timeout until the group of Executors are all complete.

Ah, interesting… are you able to give more details on what you’re running from the command line, and ideally a minimal script that shows this behavior?

Sure, no problem.

First I verified that when I run the GUI version of QuPath in Ubuntu on an AWS EC2 instance, I am seeing performance that is comparable to running it local on a desktop or laptop… detecting cells in a 1GB image took about 3.5m with 16 processors.

The script I used is essentially the following:


#note I didn't import both of these in the script, but tried both - did not seem to have an impact, but I am assuming that for command line work I'd be importing just the lib.scripting.QP one

import qupath.lib.scripting.QP
#import static qupath.lib.gui.scripting.QP.*
 
setPixelSizeMicrons(0.25,0.25)

setImageType('BRIGHTFIELD_H_E');

setColorDeconvolutionStains('{"Name" : "H&E default", "Stain 1" : "Hematoxylin", "Values 1" : "0.65111 0.70119 0.29049 ", "Stain 2" : "Eosin", "Values 2" : "0.2159 0.8012 0.5581 ", "Background" : " 255 255 255 "}');

createSelectAllObject(true);

runPlugin('qupath.imagej.detect.cells.WatershedCellDetection', '{"detectionImageBrightfield": "Optical density sum",  "requestedPixelSizeMicrons": 0.25,  "backgroundRadiusMicrons": 6.0,  "medianRadiusMicrons": 0.75,  "sigmaMicrons": 0.625,  "minAreaMicrons": 10.0,  "maxAreaMicrons": 150.0,  "threshold": 0.2,  "maxBackground": 0.0,  "watershedPostProcess": true,  "cellExpansionMicrons": 1.5,  "includeNuclei": true,  "smoothBoundaries": true,  "makeMeasurements": true}');
 

Thanks, I might have been able to reproduce this, but I’m not certain… I wasn’t seeing such huge differences in performance, but I did see some.

In any case, I noticed two things that weren’t good. I’ve addressed them and given links to more info through this pull request: https://github.com/qupath/qupath/pull/563

If you’re able to build QuPath from source (using the link under the pull request, not the master branch for the repo) I’d be really keen to know if it helps at all. I plan to include the changes in v0.2.2 – probably released in the next day or two, unless some problem emerges – so any feedback you can give you be very helpful.

Those changes should help if image tile reading is slow, and this is the source of the performance problem. That’s my best guess.

However, one mysterious thing I noticed is that performance seems to be better if I run QuPath without using the main launcher.

What I mean is that, on my Mac, this seems the more ‘normal’ way to run from the command line:

/Applications/QuPath-0.2.1.app/Contents/MacOS/QuPath-0.2.1 script /path/to/script.groovy -p /path/to/project.qpproj

However, I find that this runs about 25% faster:

cd /Applications/QuPath-0.2.1.app/Contents/app/

java -jar /Applications/QuPath-0.2.1.app/Contents/app/qupath-0.2.1.jar script /path/to/script.groovy -p /path/to/project.qpproj

(Note that the current directory is important in the second case so that the native libraries are found.)

My only guess for the discrepancy is that there is some horrific overhead involved in passing log messages back if the first case, but I don’t know if that’s actually the case.

Thanks for looking into this, I will try your pull request today, and then I can run some images of different sizes to see if I can notice a difference.

Good to know about the launcher vs instantiating via the java -jar command directly. I had been doing that since I was running another patch of yours so I’m glad I already had that optimization!

C.

This PR improves things substantially - there is still an issue somewhere, but now it appears to be off by < 2x instead of 10x.

Here’s the command I used to run the script:

java -jar -XX:MaxRAMPercentage=80 qupath-0.2.2-SNAPSHOT.jar script /SCRIPT_FILENAME.groovy --image /IMAGE_FILENAME.tif

This gave me about 12.5GB of RAM to work with on my MacBook. Tile cache percentage seems to be auto-set at 25%. There were 88 tiles in the ~930MB image.

Script result (0.2.2 snapshot)

mean 34.91 s
median 21.35 s
outliers 131.69, 183.64, 241.28, 309.13, 369.65
Total 387.40 s

GUI result (0.2.1)

mean 20.55 s
median 20.82 s
Outliers None above 25s
Total 229.89 s

Not bad! I didn’t look at how the Executors are set up and how they call back to the GUI when the batch finishes, but the fact that each outlier is about +60 seconds more than the previous outlier makes me wonder if there’s a pattern there? They didn’t fall on exact boundaries (e.g. 16,32,48 etc) but were definitely spaced out like the outliers were one per ‘set’ of tiles. Without those outliers it looks like performance would be near identical.

Related to this something that would be worth me trying is tweaking tile cache percentage - when I run it from the command line it does not seem to respect the Java preferences at ~\Library\Preferences\com.apple.java.utils.pref.plist - I’m Googling online to see if I can set that preference from the command line, but if you know how to ensure I’m setting that property I can verify and apples-to-apples comparison.

Even like this though it gets scripting on AWS closer to a price/performance that I would be expecting… this is a huge help, thanks!

1 Like

Hmmm, is it an RGB TIFF image, and do you know is it being read by Bio-Formats, OpenSlide or something else?

If you can profile it with VisualVM that may help a lot.

It is in fact a RGB TIFF, and according to the GUI it is a BioFormats image. I believe the pyramid is in the file already.

I’m new to VisualVM but I’ve downloaded it and I’m going to get up to speed on it as best I can. If I can send you some useful info from there I’ll do so asap.

Thanks again-

Here are a couple quick graphs from VisualVM. I added red dots to where the ‘spikes’ were output in the log. If there are specific things I can do from the VisualVM app to profile for you let me know!

1 Like

Under Sampler you can try CPU sampling and that can reveal where most of the time in spent. There are various views that can help identify hotspots and then trace back to methods where they are called.

OK, found a couple interesting points using VisualVM:

  1. Running the same image with same memory settings but different processors show differing results:
   CPUS  |    CLI   |   GUI   
----------------------------------
     8   |   343    |   199
     7   |          |   208
     6   |   311    |   212
     4   |   278    |   259
     3   |   295    |   289
     2   |   368    |      
     1   |   641    |      

So when I look at VisualVM, this is what you see:

On the CLI runs, it appears as if the first plugin-runner hogs the scene, and prevents other plugins from catching up and helping equally.

On the GUI runs, all runners start in parallel (+/- a few milliseconds) and keep track

I wouldn’t even begin to know why this differs between a GUI run and running on the CLI but that looks where the imbalance is occurring. I’m happy to dive into the thread if you’re looking for a specific method timing.

Thanks,
C.

Intriguing… just to check: running from the GUI is via the script editor with Run → Run, right?
If so, does it differ much if you use Run → Run for project via the GUI instead (choosing the same image)?

Also, are you doing all the VisualVM tests locally…? At this point, I’d like to try to make sure any headless / non-GUI issue is resolved – separately from any potential VPC complications.

This is the view I used in VisualVM yesterday to explore why resolving the hierarchy was slow.

It looks like the biggest difference I can spot is the time the CLI takes in
qupath.imagej.tools.IJTools.convertToImagePlus ()

Here are a couple screenshot collages:

CLI

GUI

Hope this helps,
C.

1 Like

It looks like this is calling ImageJ under the hood. This reminds me of the pain of trying to run ImageJ in headless mode because even with the headless option there seemed to still be dependencies on GUI elements. I am wondering if the problem could be caused by ImageJ waiting for a GUI component somewhere under the hood. Just my 2 c.

@jkh this is my current assumption as well. When I took a quick look here https://imagej.net/Scripting_Headless it states that it runs each call as a one-off script, I wonder if it can’t take advantage of something that is shared in the GUI.