Performance bottlenecks?

Hello all, so related to some previous posts about garbage collection behavior in micro-manager, I’ve been trying to narrow down on what might be the performance-limiting factor in high-framerate acquisitions.

I’ve run tests using recent nightlies of both 1.4 and 2.0g and see a similar limitation, that using Hamamatsu Fusion cameras any exposure shorter than 8.5 ms (fast readout, 0 ms interval, continuous acquisition on internal trigger source) results in frames piling up in sequence buffer.

These tests have run on two different workstations: 1 with dual Xeon Gold 5217’s and one with a brand new AMD Threadripper Pro 3955WX. In both cases I tested acquiring purely to memory and acquiring to fast NVMe SSD RAID arrays, with no difference.

This limitation persists with the default parallel GC (although frames pile up in bigger bursts, presumably when GC is triggered) and with a tuned G1GC set to favor more frequent garbage collection with shorter stop-the-world intervals.

This also seems to be only weakly dependent on the size of each frame as the ~8.5 ms threshold holds for both full chip (2304 x 2304) and images cropped to 400 rows. Acquiring from two cameras simultaneously also doesn’t affect this behavior.

What seems odd about this is that shorter exposure times shouldn’t increase the framerate on the camera at full chip beyond 100 FPS, but we still end up getting frames piling up in the sequence buffer with 5 ms exposures.

Has anyone else with similar cameras (I imagine the Flash 4.0 is the most common) run into similar issues trying to operate at high-ish framerates? Where lies the bottleneck? Is it (as was suggested a while ago by either @nicost or @jondaniels) metadata handling while emptying the sequence buffer? It seems to be something that isn’t addressed by a significantly faster CPU (the Threadripper Pro) with faster memory.
Edit: The fact that this threshold seems to be the same whether running 1 camera or two also seems to suggest against this from a naive standpoint, as I’d imagine there’s twice as many metadata-related operations when running an acquisition with two cameras?

I also imagine that there are folks out there running significantly faster cameras, how do those setups behave? Do you just balloon the sequence buffer to be able to soak the pile-up or is there some configuration where you can continuously acquire without hitting the bottleneck that I seem to be encountering?

1 Like

We worked with @henrypinkard a lot on this for Pycro-manager using a Fusion BT and coaXpress card setup the same way that yours is (see this issue).

We had reached a place where our setup would acquire 256x2304 or 512x2304 (cropping to achieve faster readout) at 1.5 ms exposures using a reasonably fast SSD and not fill the buffer. However, we discovered this past week that with recent updates to Micro-manager 2.0 and Pycro-manager, we can no longer acquire images this fast and now we are starting to fill the buffer again. We haven’t had time to track down which update caused the regression.

In all of our testing, we saw similar behavior that the rate of imaging, not the (rate of imaging * size of image), was the relevant parameter. Both we and @ieivanov saw similar behavior with the demo camera and real cameras. Often, everything would be going fine and then there would be a spike in CPU usage and the memory starts to fill up.

Unfortunately, I don’t have any answers - other than to say that we are struggling with this problem as well. Currently, we are using a huge buffer so that we can continuously acquire. It would be awesome to get back to our previous situation where this wasn’t a problem.

Thanks Doug, I’ve been seeing this for at least MM 1.4 going back to 2017/2018 even when acquiring to a 5-disk SSD RAID0, so I’m intrigued to hear that you were able to go a fair bit faster more recently. The spike in CPU usage / memory filling up sounds a lot like the default garbage collection behavior when running micro-manager with a default java environment. One thing you might try is to launch micro-manager using command-line java parameters, we execute something like this:
java -jar -XX:+UseG1GC -XX:+MaxGCPauseMillis=100 ij.jar

There are also flags that allow the JVM to use more threads for concurrent and parallel garbage collection and to initiate garbage collection at lower heap occupancy levels. I find that this causes more frequent blips in CPU usage as garbage collection is initiated more frequently, but the blips are significantly shorter and don’t cause runaway frames to pile up afterwards. I’d be curious to hear if that changes things for you at all or if it’s just snake oil.

One thing I’ve noticed is that running acquisitions using the diSPIM plugin, we can successfully acquire with shorter exposure times compared to acquisitions initiated using MDA. We routinely go as low as 2.5 ms, although not acquiring continuously at average framerates of up to 140 FPS. Attempting to go that fast in MDA does result in frames piling up. I haven’t done the digging yet to see what may be substantially different between how @jondaniels code sets up acquisitions vs MDA.

Based on your previous posts on the listserv and here, we tried a bunch of different GC options, which didn’t end up making a huge difference for these high speed, continuous acquisitions. The main thing that made a difference was some internal changes that Henry made on how NDTiff stored data. As long as we stayed underneath an object generation rate & length of acquisition that we unfortunately couldn’t really nail down in terms of understandable parameters, we could write at >1.0 GB/s sustained. All of the details are in that github thread, as well as some flags that Henry added to Pycro-manager to generate useful diagnostics.

Also - my understanding is that G1GC is now the default garbage collector for MM 2.0.

Thanks for filling me in! I haven’t been following the pycromanager threads as closely as I should have apparently.

You can use the following script to test the effect of metadata on the speed with which the circular buffer is handled:

nrImages = 1000;
mmc.startSequenceAcquisition(nrImages, 0, true);
nr = 0;
start = java.lang.System.currentTimeMillis();
while (mmc.isSequenceRunning() 
		|| mmc.getRemainingImageCount() > 0) {
	if (mmc.getRemainingImageCount() > 0) {
      // tagged = mmc.popNextTaggedImage();
      // test difference with this one:
      img =  mmc.popNextImage();
	}
}
end = java.lang.System.currentTimeMillis();
duration = end - start;
mm.scripter().message("It took: " + duration + " ms.");

To my surprise, I do not see a difference between popNexTaggedImage and popNextImage when using the democamera with 2 ms exposures. It does become a factor when setting the exposure time to 0.1 ms and using the Fast Image option for the demo camera. Processing of the metadata seems to take ~ 0.3 ms per image. On larger systems (i.e. with more properties), this may increase.

Hi @nicost! Thanks for chiming in. Interestingly, what Henry and I found with Pycro-manager was that emptying the circular buffer was not causing the issue, but the file saving was (I think I am fairly summarizing the results of the github issue I linked).

I often saw differences in speed and sustained throughput between Micro-manager and Pycro-manager when running a similar script to what you have above, but also trying to save the files (using flags Henry created to turn off metadata, file saving, etc…). Until Henry rewrote NDTiff to create v2.0, Micro-manager was always faster in all cases. After NDTiff 2.0, Pycro-manager started consistently outperforming Micro-manager when saving the data.

I don’t understand enough about the nitty gritty of Java to know why this is occurring. We’ve spent time tuning Python writing and acquisition (in a non-MM/PM solution) and I know that it gets complicated when trying to run super fast!

As always, your knowledge and all of your hard work are appreciated!!!

I’m looking forward to trying this when I get in tomorrow!

Doug regarding your point about file saving, what happens when you acquire to a data store in-memory only? If I leave “save images while acquiring” unchecked I get the same behavior (short exposures cause frames to pile up in the sequence buffer) which would seem like an issue with emptying the buffer fast enough to keep up.

Now that I type this I realize that I haven’t actually confirmed whether this also occurs on our Threadripper platform so I should probably also give that a shot to confirm.

From the testing, it seemed like there are two separate, but related issues going on:

  1. Something triggers a hiccup in pulling images out of hte circular buffer and it starts filling with images.
  2. Something different causes a hiccup in writing, which then eventually makes it way back to the circular buffer.

We haven’t tested going straight to RAM in a bit. We actually just received a Threadripper Pro machine today (hah!) and will be setting it up to control the instrument with the Fusion BT/coaxpress card later this week.

So running this script in 2.0g, I see negligible differences between tagged vs untagged images. Curiously, the sequence acquisition set to 1000 frames never (N=5) causes frames to pile up in the buffer, even with exposures as short as 1 ms. Running a 1000-frame MDA with interval = 0, at peak ~60-100 frames end up accumulating in the sequence buffer both when saving to disk (NVMe RAID0) and when acquiring to memory (Save Images unchecked). Using a shorter exposure (5 ms) also doesn’t cause frames to pile up when running the acquisition from the script, but causes ~200 frames to accumulate in the sequence buffer when run via MDA.

@nicost do you expect there to be a substantive difference in how MDA sets up an acquisition where only “Time Points” is checked, interval = 0, vs running mmc.startSequenceAcquisition(1000, 0, true)?

Yes, the test script only tests the part of the code pulling the images out of the C++ layer into the Java layer. When running an MDA, the image and metadata are inserted into a Datastore, either StorageRAM or StorageMultiPageTiff. In addition, images are being displayed in the viewer. There are many steps on that trajectory that could/should be optimized.

Coincidentally, I am currently working on speeding up the viewer, so that it can keep up with data streams of 200 fps and more. There are some tie ins with storage, so this work may even help in that respect, but I will certainly have a look for obvious bottlenecks next.

Nevertheless, I think that the script shows that data (even with metadata) can be pumped quite efficiently into the Java layer, so it certainly helps identify where to look for problems.

1 Like

Another interesting (and potentially useful) script is the following:

as = mm.acquisitions().getAcquisitionSettings();
as.shouldDisplayImages = false;
start = java.lang.System.currentTimeMillis();
store = mm.acquisitions().runAcquisitionWithSettings(as, true);
end = java.lang.System.currentTimeMillis();
duration = end - start;
mm.scripter().message("It took: " + duration + " ms.");
store.close();

This runs the MDA that you set up through the MDA window, but without a display. I am testing this with the demo camera set to 2048x2048, 16 bit per pixel, and played with 8 and 5 ms exposures. In general, there is less pile up of images in the circular buffer without a display. We aim to have the display be completely asynchronous, so there clearly is work to do (although this may turn out to be quite difficult work).

In addition, MultiPageTiffStorage has less pile up then RAM storage (on my laptop with NVME drive). This suggests inefficiencies in the code doing the RAM storage.

Curious if you see benefits from running the code without a display (and if that is useful at all). I could add that option back to the UI (I think we had it there in the past if I remember correctly). I will see if I can reduce the slowdown caused by the display, and look for improvement in the RAM storage speed.

B.t.w., I think that @henrypinkard uses completely different code paths in his acquisition engine and NDTiffStorage, so these consideration are not directly applicable to that code.

A Pycromanager script for acquiring as fast as possible:

from pycromanager import Acquisition, multi_d_acquisition_events

with Acquisition('/Users/henrypinkard/tmp/', 'testacq', show_display=False) as acq:
  acq.acquire(multi_d_acquisition_events(num_time_points=10000))