Actually, after looking into this a bit more, I’m starting to get doubtful about the disk access being the bottleneck in my analysis…I’m wondering if there’s something going wrong with the parallelization instead.
I’m currently running RegisterVirtualStackSlices on Fiji (ImageJ version 1.51) via the ImageJ-Matlab interface (v.0.7.1) on Matlab R2016b. I’m using an entire node of a cluster (2 x 12-core Intel Xeon E5-2680 v3 with hyperthreading disabled, 128 GB of memory). Thus far, I had been running the exact same thing on a desktop workstation (1 x 8-core Intel Xeon E5-1660 v3 with hyperthreading enabled, 64 GB of memory). I’m not using the headless mode due to some compatibility issues in the ImageJ-Matlab interface, so I’m using Xming to access the GUI on the cluster from the Windows 7 desktop machine. Fiji on the cluster is using 24 parallel threads as shown in the Options->Memory & threads and 16 threads on the desktop (due to the hyperthreading).
Much to my surprise, the runtime is almost identical on the desktop and on the cluster node even though the number of threads is 16 vs. 24 and the number of physical cores is 8 vs. 24. RAM is not an issue on either machine, less than 50 % is in use. First I thought that disk access on the cluster was the bottleneck, since the CPU load is only < 10 % with 1-2 cores at 100 % and others idling. However, I tried using a virtual RAM disk at /dev/shm instead of the hard disk, and this did not have any effect. Now I noticed that CPU load on the desktop is similar to the cluster node: a bit over 10% with 1-2 cores at close to 100 % and others pretty much idle. Moreover, my TIFF images are not that big (around 900x1400 RGB, 1-2 MB, or 1800x2800, 5-6 MB, number of sections/images 260), so I’m very doubtful that the data transfer could be the bottleneck even with the RAM resident virtual disk. In addition, MATLAB’s image registration functions use around 100 % CPU with the same images on both machines, so it can’t be the disk, right?
So, is there some kind of inherent limitation in the parallelization of RegisterVirtualStackSlices or the underlying SIFT feature extraction plugin or bUnwarpJ, which prevents the cluster node from taking advantage of the larger number of cores (and even the desktop from using all 8 cores)? Or should I somehow specify for the plugin the number of cores to use, apart from the general Fiji “parallel threads” setting? It’s a case of pairwise image registration, where each pair of consecutive sections is registered separately in parallel, so it should scale well with more CPUs?
Many thanks for any ideas,