High-Performance FFT based algorithms (Deconvolution, ...)

Other relevant tools, not yet mentioned, for this task to be aware of are:

  • The Insight Toolkit (ITK) - designed for 3D processing and segmentation of large datasets (disclaimer: I work on this project)
  • Dask - Distributed computing for Python
  • Zarr - chunked, compressed, N-D arrays: solves the parallel processing and efficient storage problems with these datasets
  • itkwidgets - 3D, interactive volume visualization in the Jupyter notebook for large, remote datasets
  • napari - Desktop visualization tool for 3D, multi-channel datasets
  • pytorch_connectomics - deep learning segmentation for connectomics

Here is a recent blog post on using ITK, Dask, and Zarr with lightsheet microscopy datasets. We plan to write more posts on this topic.

These are all open source solutions. The real advantage of open source solutions, which work on open standards, is that they can be combined and extended by you, in any way that you can imagine, to address new and challenging problems. Optionally, companies like Kitware can be hired to improved and extend the system, if you do not desire or have the time to extend the tools yourself.

Another advantage of open source tools is that you can learn how they work, which aligns with the knowledge-based mission of microscopy efforts.


Hi Matt

That’s a really cool example. Do you where I can find the PSF?? I looked through both blog posts and couldn’t see where to get it. I didn’t see it in the google drive folder with the original data either (though it’s hard to see all the files at once in the gdrive web interface).


Yes! It is in the same Google Drive dataset – but a different folder: psfs_z0p1.

1 Like

Hi Matt

Thanks again for sharing the example. I have a couple of quick questions. First off did you use only 1 iteration of RL Deconvolution?? It looks like the default setting for iterations in richardson_lucy_deconvolution is 1, and it doesn’t look like that was over-ridden anywhere. Or am I missing something?

Did you do any experiments on smaller sets of data to verify that the deconvolution was producing a good result?? I did a few experiments on your image using Richardson Lucy Deconvolution with 100 iterations. I found a couple of things.

  1. A GPU impelmentation of Richardson Lucy was apr. 100 times faster than the ITK implementations.
  2. Even using 100 iterations of the GPU implementation I couldn’t get a very good result.

I wasn’t surprised that the result still isn’t good, as the PSF was noisy, and would need to be cleaned up. I’ll probably give it another try sometime in the next couple weeks, and work on cleaning the PSF and perhaps adding some more intelligent edge handling.

Let me know if you, or your collaborators have already gone down that path and if so what you did, to get better results.


@bnorthan thanks for the follow-up!

The blog post was intended to demonstrate how to deconvolve on a very large in Python. I did not explore the ideal number of iterations or GPU implementations or refinement of the PSF, but work could certainly be done in this area. Perhaps you would like to co-author? We could show also show a CUDA implementation or even an OpenCL implementation via @haesleinhuepf 's clij exposed via ITK Python packages.

I observed that deconvolution occurred and resolution improved – did you not observe this? How is a “good result” being defined?


Hi Matt

I’d be happy to co-author an article. Let me know how you think we should approach this. In the mean time I done some testing on phantom Images, to try and understand the behavior of the algorithms better. I’ve tested my version of the YacuDecu library (which I’ve made a few enhancements to) and ITK Richardson Lucy here

The main issue right now with the ITK implementation is speed. I’ve deconvolved the ‘bars’ test image with 100 iterations of RL. YacuDecu takes 0.7 seconds for 100 iterations (see cell 9), while ITK takes 1200 seconds (see cell 12). Let me know if I’ve done anything incorrectly when calling the ITK version. My experience is an optimized GPU version is about 5x faster than an optimized CPU version of deconvolution, not 1000x.

That is a good question, and it is difficult to quantify “resolution” improvement on real images (as we have no ground truth). Resolution itself is a nebulous concept. Deconvolution improves contrast between features, thus some “say” it improves resolution… but what it really does is increase the contrast between features, thus if you are using a contrast based criteria to define resolution (like the Rayleigh criteria), you can technically say “resolution” improved because the contrast between emitters improved.

I ran another test here where I compared 1 iteration of RL to 100 iterations. Line profiles generated with ImageJ are below.

After 1 iteration of RL, there is only subtle improvement (note the edges of the hollow bar have a slightly increased intensity and background is slightly reduced), after 100 iterations there is much greater contrast enhancement.

I would be surprised if 1 iteration of RL (that was applied to the real image you shared) would have resulted in much contrast improvement. I did try 100 iterations on the real image, but I still didn’t get the contrast improvement I would of expected. In the next few days, I’ll make another attempt at deconvolving the real image and try to convey some of the issues I saw in a more quantitative way.

1 Like

Just to make sure I give credit, I copied code from here to set up the python wrappers.

1 Like

Hi @haesleinhuepf, @thewtex

So I’ve spent some more time testing different Richardson Lucy implementations using the Bars image. I added the CLIJ implementation to my tests. Note it is pretty slow. This is almost certainly nothing to do with any limitation of CLIJ itself, but is due to the fact that CLIJ does not yet (as far as I know) have an FFT. Spatial convolution is way slower than FFT convolution for realistic sized PSFs. The RL algorithm is built from convolutions, so any implementation based on spatial convolutions will be slower. Realistically to do deconvolution in CLIJ you will need a fast OpenCL FFT possible candidate here.

The ITK implementation also seems to be slow. I haven’t had a chance to look into why. Do you know what FFT library is the ITK implementation using @thewtex ??

My Cuda/MKL/CLIJ/Java test is here.

And my Python Cuda/ITK test is here

Implementation Time 10 iterations (sec) Time 100 iterations (sec)
Ops Java 29 287
Ops MKL 5 28
Ops CUDA 1.95 3.184
CLIJ 823 not tested
Python CUDA 0.3 0.77
Python ITK 131 1188

Hey @bnorthan,

thanks for your initiative! I love challenges. CLIJ is pretty slow contradicting my former comparison between ImageJ-Ops and CLIJ. Would you mind sharing details on the test computer? If it has several GPUs, I could imagine entering a GPU name in brakets in this line could speed things up. Just guessing…

Regarding the FFT - I was thinking of using jtransforms as we use it in on our microscopes for focus-finding but it’s not maintained anymore apparently. Do you have experience with it by chance?


Oh and @bnorthan,

would you mind sharing the output of this line?



Hey @haesleinhuepf, I’m out for a bit, but will get more details about the test and test computer when I get back.

One thing I should mention very quickly is that spatial convolution slows down a great deal with Kernel size. So the difference between my test and yours might be because I used a fairly large kernel (64 by 64 by 100). Any chance you could run your test with different kernel sizes?? Perhaps take a look at 5 by 5 by 5 Kernel, vs 50 by 50 by 50?? If you don’t get a chance I’ll do it myself later on when I get back.

1 Like

Yes, that’s likely the reason. Time to get an FFT in CLIJ :wink:

1 Like

Let me just add a link to gearshiFFT, co-authored by a friend of mine:

Hi @haesleinhuepf

Using your DeconvolveBenchmark as a template I made DeconvolveBenchMarkGauss

The reason I used a Gaussian kernel, is simply because it’s easy to control the size, and run experiments of Deconvolution speed vs kernel size.

(Just to make sure there is no confusion from anyone reading this ops-experiments-CLIJ started with a mass cut and paste of @haesleinhuepf code, I just put it into my project so I could easily import the CLIJ example in my other tests)

Anyway here is the output, including the output from CLIJ.info(). Note CLIJ decon beats ops with a small kernel, but ops is faster with a large kernel. Also note (see benchmark earlier in this post), that the implementation based on cufft is under 1 second, for this size of image. I suspect you can get the same performance with clfft or other GPU FFT .

Available CL backends:
  * net.haesleinhuepf.clij.clearcl.backend.jocl.ClearCLBackendJOCL@3961a41a
    Functional backend:net.haesleinhuepf.clij.clearcl.backend.jocl.ClearCLBackendJOCL@7af1cd63
    Best backend:net.haesleinhuepf.clij.clearcl.backend.jocl.ClearCLBackendJOCL@291120f4
Used CL backend: net.haesleinhuepf.clij.clearcl.backend.jocl.ClearCLBackendJOCL@1e6b9a95
ClearCL: ClearCLBase [mClearCLBackendInterface=net.haesleinhuepf.clij.clearcl.backend.jocl.ClearCLBackendJOCL@1e6b9a95, mPeerPointer=null]
  Number of platforms:1
     Number of devices: 1
     Available devices: 
     [0] GeForce RTX 2060 
        NumberOfComputeUnits: 30 
        Clock frequency: 1200 
        Version: 1.2 
        Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer 
        GlobalMemorySizeInBytes: 6222839808 
        LocalMemorySizeInBytes: 49152 
        MaxMemoryAllocationSizeInBytes: 1555709952 
        MaxWorkGroupSize: 1024 
        Compatible image types: [SignedNormalizedInt8, SignedNormalizedInt16, UnsignedNormalizedInt8, UnsignedNormalizedInt16, SignedInt8, SignedInt16, SignedInt32, UnsignedInt8, UnsignedInt16, UnsignedInt32, HalfFloat, Float]
Best GPU device for images: GeForce RTX 2060
Best largest GPU device: GeForce RTX 2060
Best CPU device: GeForce RTX 2060

PSF size is 19 19 19
deconvolve with ops 4579
deconvolve with clij 1989

PSF size is 31 31 31
deconvolve with ops 7072
deconvolve with clij 8429

PSF size is 43 43 43
deconvolve with ops 7051
deconvolve with clij 22638

PSF size is 55 55 55
deconvolve with ops 10601
deconvolve with clij 49704

Great work @bnorthan! Thanks for putting the time into your investigations.

From the output images, it appears that you have confirmed that all implementations are generating the same deconvolution result :+1:.

I did not know that ImageJ Ops had support for MKL and CUDA – nice!

Yes, the default FFT backend in ITK, the backend currently enabled in the ITK Python binaries, is VNL. VNL’s implementation is known to be slow relative to other implementations. ITK also has FFTW, MKL, and cuFFT optional backends. Here are some benchmark results on another FFT-limited operation:


FFTW is generally much faster than VNL and faster than MKL, but its distribution is limited by the GPL. MKL’s license is not the greatest, but it is re-distributable even though it can be painful. We will move towards enabling MKL in ITK’s conda-forge Python builds. cuFFT and clFFT can be much faster, but their performance improvement was limited in the above results due to an old PCIe bus (which is important to identify).

I gained experience with clFFT when it was initially released :older_man:, but it is worth revisiting. We are generating Python packages for ITK’s OpenCL backend, and it would be cool if we could bring it together with CLIJ.

1 Like

MKL and Cuda support is experimental, we’ve (@hadim especially) worked out the mechanics of how to write the wrappers, build native and java code together with maven (and optionally cmake), and convert from imglib2 to c-pointers, however it’s not really a “back-end” so to say. Anyone wanting to use MKL or Cuda in an op, would have to write the MKL/Cuda code, then wrap it manually, as opposed to writing the op in java, and have it intelligently choose a high performance back-end under the hood.

Also the imglib2 to native converters are inefficient, and need more work.

My experience is that transferring the data from device (GPU) to host is a performance killer. At least for deconvolution. To write a really fast implementation, all operations have to be on the GPU. A backend for just the FFT isn’t enough, you need to do everything on the GPU, or it’s like a 10x performance hit.


Hey @thewtex,

THIS - 1000 times this. Is it ready for users? If yes, where is it? How can I test it? I may need a minimal working example as I’m not so much a python person :wink:


1 Like


Yes! It will be interesting to see how we manifest this with clFFT / deconvolution / CLIJ / ITK.

Another comparison and reference that we can also wrap into Python via ITK is ArrayFire’s implementation, which has both OpenCL and CUDA backends.

:smiley: !

Only adventurers :wink:

Currently under active development, but a preview:

$ python -m pip install itk-gpucommon

(this is currently only available for Linux :penguin:, but OSX and Windows are coming…)

$ python
>>> import itk
>>> reducer = itk.GPUReduction[itk.SS].New()
>>> reducer.RandomTest()
1 Like

That’s what I am. Accelerated thanks!


Hi @haesleinhuepf

Were you thinking of porting jtransform to CLIJ?? Or just using the java version for other things you are working on?? I have tested it a bit, but moved on to MKL and cufft, since I am comfortable writing native code with those libraries.

I am running tests on ArrayFire FFT right now (a wrapper to cufft or opencl) and will report some results soon.

1 Like