High-Performance FFT based algorithms (Deconvolution, ...)

Hi @bnorthan,

I was thinking of using (wrapping) Jtransforms or any other FFT library with minimal dependencies in CLIJ. CUDA is of course out of question as CLIJ aims being usable by everyone and not just NVidia customers :wink:

If you think an other library is the better choice than Jtransforms, I would love to learn more! I trust your expertise! :smirk:


Hi @haesleinhuepf,

I assume you are aware that @mweigert has some bare-bones OpenCL-based RL-deconvolution in gputools? It is using reikna for OpenCL-based fft. I believe reikna is Python-specific, not sure whether clFFT can be a drop-in replacement if it is to work with Java as well.
Martin’s code can be sped up quite a bit by saving the FFT plan (which was recomputed for each iteration) and keeping the PSF and intermediate results in GPU memory. I had tried such modifications at some point and got speeds that were comparable to CUDA-based libraries (flowdec). However, I had more artefacts and didn’t have time to look into that (probably required better boundary padding strategies).

I think the thread has drifted quite considerably from the original question about BDV and general 3D processing and the title no longer reflects this. Maybe one of the forum admins (@imagejan) can split the thread and change the title so it can be found more easily (if the main contributors to the thread agree)?


I’m fine with splitting the latter part of the thread into a separate thread. Perhaps call it “High performance FFT based algorithms (Deconvolution, etc.)”. That title would indicate that while we are focusing on Richardson Lucy deconvolution in the discussion, the strategies should be applicable to other algorithms based on FFT.

(also fine with just using this thread, if it’s a pain to split).

Yes, to get optimal performance for clfft you need to

  1. Create the FFT plans only once and re-use them (it’s expensive each time you create them)
  2. Run the algorithm completely in GPU memory, as it’s expensive to transfer data to CPU and back).
  3. Carefully choose the right padding size and strategy (the size needs to be “FFT-friendly”)

So to have an optimal implementation in CLIJ, you’d need either mechanisms to keep track of all this (like java wrappers to the plans), or implement most of the algorithm in c/c++ and have a higher level api to call the deconvolution (or other FFT baed algorithm).

I’m currently playing around with the opencl-based fft interface. If I want to make an RL implementation would you recommend an opencl based math library?? If so what?? Or should I just write kernels myself?? (just need multiplication, divide and conjugate).


1 Like

To my knowledge, clij is the first “library” of that kind …

Thanks for the heads up. As I didn’t follow this thread in full length, I suggest some of the involved persons take care of splitting the posts into one or several new topics. All users with moderator privilege are supposed to be able to do that. The procedure is outlined here under Off-topic digression:

@haesleinhuepf is moderator already. I just granted moderation privileges to @VolkerH as well. Please let me know if this works. Thanks for helping keeping the forum organized (and for the above discussion :slightly_smiling_face:)!

@imagejan Thanks for the bump to moderator.

I moved the discussion on FFT GPU away from the original thread (missed a few posts on first attempt, hope it is all good now).


@haesleinhuepf, @VolkerH, @thewtex

I spent some time this week playing with OpenCL (clfft) and ArrayFire and cooked up some rough experimental code to test FFT and deconvolution speed. (very rough, and only for hackers and debuggers and other early adapters, still has some bugs, and only tested on power of 2 sizes, but still informative for early benchmarks).

Implementation Time 100 iterations (sec)
ITK 1000+
CUDA 0.94
OpenCL 1.03
Array-Fire (Open CL backend) 1.5

Array-fire Richardson Lucy. 30 lines.

Pure OpenCL Richardson Lucy ~200 lines + kernels.

2D FFT in OpenCL (clfft)

Thoughts and questions

  1. All of this code is in c++. @haesleinhuepf do you have any examples showing how to combine CLIJ with existing c/c++ code. I assume you would just need a wrapper to pass ClearCLBuffer to c/c++ ??
  2. Array-fire is pretty slick. There is some overhead (about 50% in my experiment), so I’d need to test it more thoroughly (non-power of 2, and understand how the overhead scales with problem size). @thewtex have you done much testing with Array-fire?? Are there any particular limitations to be aware of?
  3. To support FFT based algorithms properly CLIJ would need complex math. Have you implemented any complex kernels yet @haesleinhuepf or @VolkerH ?? Part of the work for the opencl decon was in creating complex math kernels
  4. It also seems that CLFFT does not support all sizes. This means images would have to be padded to a supported size. This is similar to the situation with the FFT in ops, and in ops we wrote had to write some ops to calculate the supported FFT size.

Hey @bnorthan

thanks again for your efforts.

I have some experience with wrapping DLLs on Windows in Java using JNAerator: example 1 example 2. Not sure if this also works on Linux/Mac.

But why not calling OpenCL directly from Java via CLIJ? Would you mind pointing me to the c++ parts which contain algorithmic work in clfft which need to be translated to java? I would be happy to port it to clijs convolution plugin and potentially make it part of clij2.

Another question: What license does the ops-experiments code have? I recently stepped over issues with incompatible licences. CLIJ is BSD3.

No, not yet.



The c++ part of clfft can be found here. There is a lot of complicated logic in setting up the FFT plan for given dimensions, datatype and sizes, which is why I initially thought a c++ wrapper would be the way to go (as an aside I could easily make an interface with javacpp, with the cavaet it would have to use pointers, and an additional conversion step, as I don’t yet know how to get cl_mem from ClearCLBuffer and pass cl_mem to c++, there is probably an easy way just using a void* or something, but was hoping just to follow an existing example).

If clfft can be converted directly to CLIJ that would be great.

Simplified BSD https://github.com/imagej/ops-experiments/blob/master/pom.xml#L35.

1 Like

Every ClearCL object has a getPeerPointer() method which gives you the long pointer C++ should accept. The other way around might be bit more complicated. But we do have some more Any-Kind-Of-C+±Array converters for clij in clearcontrol.

I think so, too. I take a closer look at the C++ code asap.

1 Like

Hi @bnorthan,
Good timing, I actually had a play with this last week (partly inspired by the discussion here).
I haven’t actually created any OpenCL kernels so far, I have been using some of the OpenCL kernels that came with @mweigerts cputools. I should look into it. The gputools approach does use complex dtypes and complex maths.

Here is an example of a simple RL deonvolution: https://github.com/VolkerH/Lattice_Lightsheet_Deskew_Deconv/blob/master/lls_dd/deconv_gputools.py (use on examples here: https://github.com/VolkerH/pyDecon/blob/master/gputools_decon_examples.py) It is basically the same mechanics as in gputools with a few issues addressed that considerably improve speed (fft plan is reused, intermediate data stays in GPU memory rather than being converted back to numpy). The OpenCL kernels are directly from @mweigert’s code and in particular for the inplace multiplications/divisions they are a bit confusing as sometimes the first operand is modified inplace and sometimes the second.

I also attempted to add Andrew-Biggs acceleration to this using David Hofman’s plain python code as a template https://github.com/david-hoffman/pyDecon . I have managed to marry up the OpenCl-based RL core update step with a numpy-based acceleration calculation and have started trying to port the acceleration calculation to gputools-based OpenCL. The latter does not work properly yet and my gut feeling is that it is a real-valued vs. complex arithmetic issue. You can find those experiments in my github fork of pyDecon (but this is very hacky/messy, not really intended for sharing at this stage).

Here’s a teaser using the Drosophila dataset from EPFL sample datasets from the DeconvolutionLab2 paper (Daniel Sage et al). It shows the original, 14 regular RL iterations and 8 RL iterations with Andrew Biggs acceleration (with the accelerations calculated using a mix of CPU numpy and GPU calculations).

So next step for me is to make the acceleration calculations work correctly with gputools, but I won’t have time to look at it next week.
I have a gut feeling the acceleration calculation would almost be easier to implement in C (directly as a kernel) but I need to read up on that.
Note that I was basically just hacking around using code snippets from @mweigert and David Hoffman…


That would be great. Having all the parts needed for the decon algorithm as C++ openCL kernels would easily make them accessible from both Java and Python and would be nice and clean.

1 Like

:+1: In the process of setting up the C++ interface, if we can integrate it into ITK pipelines, that would provide another integration node for C++ and Python that avoids data transfers to/from the GPU.

@bnorthan great work!

The licensing of ArrayFire is fine. For Python packaging, we will have to take extra steps to distribute the ArrayFire SDK. The ArrayFire deconvolution implementations are based on what is available in ITK, so there is more beyond Richardson-Lucy.

One approach is to have the FFT implementations provide the GreatestPrimeFactor they support, and use this to determine default padding in the deconvolution implementation, but allow it to be overridden so the same padding can be used for performance comparisons.

1 Like

AFAIK gputools uses reikna

Wow, lots of cool stuff in the thread! :slight_smile:

My two cents (python specific)

  • pyopencl is imho a great and actively maintained python library for basic opencl stuff in Python with a nice API for real and complex valued arrays/buffers that support basic mathematical functions/operations (mostly elementwise, but as well scans and reductions/sums). E.g. the code x *= y will automatically generate inplace kernels, which is pretty neat.

  • I always had a hard time building clFFT especially on OSX (e.g. it never worked on any of my Macs). Anyone got it working on OSX and was there a trick you remember? Still, I think distribution (without binaries) might be an issue.

  • reikna provides a pure pyopencl FFT implementation. Imho it only does c2c and thus is slower (factor ~2) but works cross platform without binaries/compilation (the main reason why it’s used in gputools). Not too useful outside python though.

  • @bnorthan Wow, pretty exhaustive comparisons! :slight_smile:
    Here’s a gist with the same example in pyopencl/reikna/gputools that takes roughly 0.9 s for 100 iterations (mind that this uses no real padding/tapering, just a short demo)

  • @VolkerH Thats looks so cool even more with napari! :slight_smile: As for the Biggs acceleration I remember that @talley already tried out some things too…(And sorry for the confusing inplace kernels :wink:

  • On a different note (I think @VolkerH knows more) there is flowdec which does RL deconv with tensorflow ffts (and all the correct padding etc). The above example runs in 1s/100 it. Pretty impressive (but of course tied to tensorflow).




@mweigert, thanks for those comments

Regarding flowdec: I am really happy with the results I am getting from flowdec and due to the effort that Eric Czech put into proper padding and documentation it is really a high quality library. Apart from tensorflow limitiing gpu acceleration to CUDA/Nvidia hardware as you already mentioned, making changes to the computation is not that trivial as one needs to build a tensorflow compute graph that gets executed at the end (i.e. not using eager evaluation). At least for me that is quite a different way of programming compared to regular python or C. Also, it is quite GPU memory hungry as it allocates the arrays for the whole compute graph IIRC where other implementations might perform some operations in place.

GPU memory is a problem in general for real-world data such as SPIM volumes. In particular with lower-end consumer GPUs where OpenCL is often the only option, this will limit the practical applicability unless one also writes code that breaks the volumes into overlapping blocks, deconvolves them individually and fuses them back together. Dask provides a lot of the mechanics to do this easily but at the cost of quite a bit of computational overhead (see here).

1 Like

Are these wrappers going both ways?? It would be nice to understand how to call c++ opencl code, from java while avoiding memory transfers. I played around with ClearCLBuffer a bit, and managed to get the long pointer, but couldn’t do anything useful with it on the c++ side.

That is probably because I was using a different cl_context on the c++ side and a different cl_command_queue and I don’t think that it is safe to use memory from one context with another. Plus it’s unclear to me what cl_mem represents on the java and c++ side. It looks like it’s a structure on the c++ side, and some kind of native pointer on the java side.

In an ideal world it would be nice to be able to pass core opencl structures (cl_context, cl_command_queue, cl_mem) from java to c++, to avoid re-initializing the framework, and (as you mention) also avoid extraneous gpu to host back to gpu memory transfers.


1 Like

A quick update.

I figured out how to get the pointers to the context, queue and clmem structures from CLIJ, and pass them to c++ opencl code (using a javacpp wrapper). This is really valuable, at least to me, because theoretically I can now call any existing c++ opencl code directly on existing CLIJ cl_mem structures, without having to perform extra transfer.

However I currently have a terrible hack to get the pointers. It looked like nativePointer in NativePointerObjectis protected. However I managed to parse it from the String override in cl_mem.

Obviously this wouldn’t work in general, so what is the proper way to get the long pointer in CLIJ??

Here is what I have

ClearCLBuffer gpuInput = clij.push(img);
ClearCLPeerPointer clPointer= gpuInput.getPeerPointer();
Object pointer=clPointer.getPointer();

and this outputs

class net.haesleinhuepf.clij.clearcl.ClearCLPeerPointer
class org.jocl.cl_mem

So I have a cl_mem, but that derives from NativePointerObject and does not seem to expose the address of the long pointer… thus my hack…

@haesleinhuepf am I overlooking the function I need to use to get the long pointer?? Or is there a Utility that does it??



Hey @bnorthan,

wow, you’re digging really deep. I’m not aware of any utility class. I inspected the NativePointerObject in very detail and concluded that it is not possible to read out the nativePointer field. Conclusion: We need to make it public. I just quickly tested on a fork and it looks like it’s working:

    public static void main(String... args)  {
        CLIJ clij = CLIJ.getInstance();

        ClearCLBuffer buffer = clij.createCLBuffer(new long[]{10, 10}, NativeTypeEnum.Float);

        cl_mem memPointer = (cl_mem) buffer.getPeerPointer().getPointer();
        System.out.println("Buffer: " + memPointer.getNativePointer());

        cl_context contextPointer = (cl_context) clij.getClearCLContext().getPeerPointer().getPointer();
        System.out.println("Context: " + contextPointer.getNativePointer());

Feel free to try it on your system!

I filed a pull-request on the JOCL library. Let’s see what happens.

Merry Christmas btw :christmas_tree: :smiley:


1 Like

Hey all

Thanks for all the links and information in this thread. Based partly on the info in this thread I’ve written a proof of concept Fiji jython scrip that calls 3D FFT based deconvolution directly on a CLBuffer

(btw @haesleinhuepf any further thoughts on wrapping clFFT or re-implementing a complete CLIJ FFT library?? I personally don’t mind writing low level stuff in c, because then I can easily wrap as Matlab mex interfaces and c-python, in addition to java).

Anyway to implement clFFT based RL required a few parts some of which are briefly (and very roughly) outlined below.

Complex Kernels

I needed kernels and a storage format for complex math. In this case I used interleaved float arrays because that is the format returned by clFFT. Example Complex Kernels are here. Similar kernels are available elsewhere.

Get and install clFFT

Get clFFT binaries then unpack to /opt/clFFT/ on Linux.

Windows instructions still “TODO”.

Real to Complex 2D FFT using clFFT

It may be feasible to wrap all of clFFT with java, and perhaps this is the direction CLIJ will go in.
However in many cases algorithm developers will want to work on native c code anyway (to make it easier to use and wrap in other frameworks). As well clFFT may be too obtuse for routine use. Thus I worked out a hack to get the long pointer to a CLBuffer so I could pass it to c-code. @haesleinhuepf currently has a pull request that exposes the long pointer so in the future we can avoid hacking it.

an example 2D Real to Complex FFT function which works on long pointers to existing GPU memory, context and queue

3D Richardson Lucy Deconvolution using clFFT

clFFT based Richardson Lucy implementations that works on long pointers to existing GPU Memory, context and queue

JavaCPP Wrappers

Native build scripts here and here .

The JavaCPP Wrapper is here

Native Build and Wrapper Build are invoked from the POM.

How to call from Java

An example showing how to call the 2D Real to Complex FFT
Note clFFT does not support all sizes, so need to extend to a smooth number first

img = (RandomAccessibleInterval<FloatType>) ops.run(
			DefaultPadInputFFT.class, img, img, false);

An example showing how to call 3D deconvolution directly on GPU memory (CLBuffer)

Note: the PSF is passed in as a java RAI, because there is additional conditioning that needs to be done. PSF needs to be extended to the image size and shifted so the center is at 0,0.

// extend and shift the PSF
RandomAccessibleInterval<FloatType> extendedPSF = Views.zeroMin(ops.filter().padShiftFFTKernel(psf, new FinalDimensions(gpuImg.getDimensions())));

// transfer PSF to the GPU
ClearCLBuffer gpuPSF = clij.push(extendedPSF);