High-Performance FFT based algorithms (Deconvolution, ...)

Hi @bnorthan,

I was thinking of using (wrapping) Jtransforms or any other FFT library with minimal dependencies in CLIJ. CUDA is of course out of question as CLIJ aims being usable by everyone and not just NVidia customers :wink:

If you think an other library is the better choice than Jtransforms, I would love to learn more! I trust your expertise! :smirk:

Cheers,
Robert

Hi @haesleinhuepf,

I assume you are aware that @mweigert has some bare-bones OpenCL-based RL-deconvolution in gputools? It is using reikna for OpenCL-based fft. I believe reikna is Python-specific, not sure whether clFFT can be a drop-in replacement if it is to work with Java as well.
Martin’s code can be sped up quite a bit by saving the FFT plan (which was recomputed for each iteration) and keeping the PSF and intermediate results in GPU memory. I had tried such modifications at some point and got speeds that were comparable to CUDA-based libraries (flowdec). However, I had more artefacts and didn’t have time to look into that (probably required better boundary padding strategies).

I think the thread has drifted quite considerably from the original question about BDV and general 3D processing and the title no longer reflects this. Maybe one of the forum admins (@imagejan) can split the thread and change the title so it can be found more easily (if the main contributors to the thread agree)?

2 Likes

I’m fine with splitting the latter part of the thread into a separate thread. Perhaps call it “High performance FFT based algorithms (Deconvolution, etc.)”. That title would indicate that while we are focusing on Richardson Lucy deconvolution in the discussion, the strategies should be applicable to other algorithms based on FFT.

(also fine with just using this thread, if it’s a pain to split).

Yes, to get optimal performance for clfft you need to

  1. Create the FFT plans only once and re-use them (it’s expensive each time you create them)
  2. Run the algorithm completely in GPU memory, as it’s expensive to transfer data to CPU and back).
  3. Carefully choose the right padding size and strategy (the size needs to be “FFT-friendly”)

So to have an optimal implementation in CLIJ, you’d need either mechanisms to keep track of all this (like java wrappers to the plans), or implement most of the algorithm in c/c++ and have a higher level api to call the deconvolution (or other FFT baed algorithm).

I’m currently playing around with the opencl-based fft interface. If I want to make an RL implementation would you recommend an opencl based math library?? If so what?? Or should I just write kernels myself?? (just need multiplication, divide and conjugate).

Brian

1 Like

To my knowledge, clij is the first “library” of that kind …

Thanks for the heads up. As I didn’t follow this thread in full length, I suggest some of the involved persons take care of splitting the posts into one or several new topics. All users with moderator privilege are supposed to be able to do that. The procedure is outlined here under Off-topic digression:

@haesleinhuepf is moderator already. I just granted moderation privileges to @VolkerH as well. Please let me know if this works. Thanks for helping keeping the forum organized (and for the above discussion :slightly_smiling_face:)!

@imagejan Thanks for the bump to moderator.

I moved the discussion on FFT GPU away from the original thread (missed a few posts on first attempt, hope it is all good now).

2 Likes

@haesleinhuepf, @VolkerH, @thewtex

I spent some time this week playing with OpenCL (clfft) and ArrayFire and cooked up some rough experimental code to test FFT and deconvolution speed. (very rough, and only for hackers and debuggers and other early adapters, still has some bugs, and only tested on power of 2 sizes, but still informative for early benchmarks).

Implementation Time 100 iterations (sec)
ITK 1000+
CUDA 0.94
OpenCL 1.03
Array-Fire (Open CL backend) 1.5

Array-fire Richardson Lucy. 30 lines.

Pure OpenCL Richardson Lucy ~200 lines + kernels.

2D FFT in OpenCL (clfft)

Thoughts and questions

  1. All of this code is in c++. @haesleinhuepf do you have any examples showing how to combine CLIJ with existing c/c++ code. I assume you would just need a wrapper to pass ClearCLBuffer to c/c++ ??
  2. Array-fire is pretty slick. There is some overhead (about 50% in my experiment), so I’d need to test it more thoroughly (non-power of 2, and understand how the overhead scales with problem size). @thewtex have you done much testing with Array-fire?? Are there any particular limitations to be aware of?
  3. To support FFT based algorithms properly CLIJ would need complex math. Have you implemented any complex kernels yet @haesleinhuepf or @VolkerH ?? Part of the work for the opencl decon was in creating complex math kernels
  4. It also seems that CLFFT does not support all sizes. This means images would have to be padded to a supported size. This is similar to the situation with the FFT in ops, and in ops we wrote had to write some ops to calculate the supported FFT size.
3 Likes

Hey @bnorthan

thanks again for your efforts.

I have some experience with wrapping DLLs on Windows in Java using JNAerator: example 1 example 2. Not sure if this also works on Linux/Mac.

But why not calling OpenCL directly from Java via CLIJ? Would you mind pointing me to the c++ parts which contain algorithmic work in clfft which need to be translated to java? I would be happy to port it to clijs convolution plugin and potentially make it part of clij2.

Another question: What license does the ops-experiments code have? I recently stepped over issues with incompatible licences. CLIJ is BSD3.

No, not yet.

Thanks!

Cheers,
Robert

The c++ part of clfft can be found here. There is a lot of complicated logic in setting up the FFT plan for given dimensions, datatype and sizes, which is why I initially thought a c++ wrapper would be the way to go (as an aside I could easily make an interface with javacpp, with the cavaet it would have to use pointers, and an additional conversion step, as I don’t yet know how to get cl_mem from ClearCLBuffer and pass cl_mem to c++, there is probably an easy way just using a void* or something, but was hoping just to follow an existing example).

If clfft can be converted directly to CLIJ that would be great.

Simplified BSD https://github.com/imagej/ops-experiments/blob/master/pom.xml#L35.

1 Like

Every ClearCL object has a getPeerPointer() method which gives you the long pointer C++ should accept. The other way around might be bit more complicated. But we do have some more Any-Kind-Of-C+±Array converters for clij in clearcontrol.

I think so, too. I take a closer look at the C++ code asap.

1 Like

Hi @bnorthan,
Good timing, I actually had a play with this last week (partly inspired by the discussion here).
I haven’t actually created any OpenCL kernels so far, I have been using some of the OpenCL kernels that came with @mweigerts cputools. I should look into it. The gputools approach does use complex dtypes and complex maths.

Here is an example of a simple RL deonvolution: https://github.com/VolkerH/Lattice_Lightsheet_Deskew_Deconv/blob/master/lls_dd/deconv_gputools.py (use on examples here: https://github.com/VolkerH/pyDecon/blob/master/gputools_decon_examples.py) It is basically the same mechanics as in gputools with a few issues addressed that considerably improve speed (fft plan is reused, intermediate data stays in GPU memory rather than being converted back to numpy). The OpenCL kernels are directly from @mweigert’s code and in particular for the inplace multiplications/divisions they are a bit confusing as sometimes the first operand is modified inplace and sometimes the second.

I also attempted to add Andrew-Biggs acceleration to this using David Hofman’s plain python code as a template https://github.com/david-hoffman/pyDecon . I have managed to marry up the OpenCl-based RL core update step with a numpy-based acceleration calculation and have started trying to port the acceleration calculation to gputools-based OpenCL. The latter does not work properly yet and my gut feeling is that it is a real-valued vs. complex arithmetic issue. You can find those experiments in my github fork of pyDecon (but this is very hacky/messy, not really intended for sharing at this stage).

Here’s a teaser using the Drosophila dataset from EPFL sample datasets from the DeconvolutionLab2 paper (Daniel Sage et al). It shows the original, 14 regular RL iterations and 8 RL iterations with Andrew Biggs acceleration (with the accelerations calculated using a mix of CPU numpy and GPU calculations).

So next step for me is to make the acceleration calculations work correctly with gputools, but I won’t have time to look at it next week.
I have a gut feeling the acceleration calculation would almost be easier to implement in C (directly as a kernel) but I need to read up on that.
Note that I was basically just hacking around using code snippets from @mweigert and David Hoffman…

2 Likes

That would be great. Having all the parts needed for the decon algorithm as C++ openCL kernels would easily make them accessible from both Java and Python and would be nice and clean.

1 Like

:+1: In the process of setting up the C++ interface, if we can integrate it into ITK pipelines, that would provide another integration node for C++ and Python that avoids data transfers to/from the GPU.

@bnorthan great work!

The licensing of ArrayFire is fine. For Python packaging, we will have to take extra steps to distribute the ArrayFire SDK. The ArrayFire deconvolution implementations are based on what is available in ITK, so there is more beyond Richardson-Lucy.

One approach is to have the FFT implementations provide the GreatestPrimeFactor they support, and use this to determine default padding in the deconvolution implementation, but allow it to be overridden so the same padding can be used for performance comparisons.

1 Like

AFAIK gputools uses reikna

Wow, lots of cool stuff in the thread! :slight_smile:

My two cents (python specific)

  • pyopencl is imho a great and actively maintained python library for basic opencl stuff in Python with a nice API for real and complex valued arrays/buffers that support basic mathematical functions/operations (mostly elementwise, but as well scans and reductions/sums). E.g. the code x *= y will automatically generate inplace kernels, which is pretty neat.

  • I always had a hard time building clFFT especially on OSX (e.g. it never worked on any of my Macs). Anyone got it working on OSX and was there a trick you remember? Still, I think distribution (without binaries) might be an issue.

  • reikna provides a pure pyopencl FFT implementation. Imho it only does c2c and thus is slower (factor ~2) but works cross platform without binaries/compilation (the main reason why it’s used in gputools). Not too useful outside python though.

  • @bnorthan Wow, pretty exhaustive comparisons! :slight_smile:
    Here’s a gist with the same example in pyopencl/reikna/gputools that takes roughly 0.9 s for 100 iterations (mind that this uses no real padding/tapering, just a short demo)
    rl_deco

  • @VolkerH Thats looks so cool even more with napari! :slight_smile: As for the Biggs acceleration I remember that @talley already tried out some things too…(And sorry for the confusing inplace kernels :wink:

  • On a different note (I think @VolkerH knows more) there is flowdec which does RL deconv with tensorflow ffts (and all the correct padding etc). The above example runs in 1s/100 it. Pretty impressive (but of course tied to tensorflow).

Cheers,

Martin

4 Likes

@mweigert, thanks for those comments

Regarding flowdec: I am really happy with the results I am getting from flowdec and due to the effort that Eric Czech put into proper padding and documentation it is really a high quality library. Apart from tensorflow limitiing gpu acceleration to CUDA/Nvidia hardware as you already mentioned, making changes to the computation is not that trivial as one needs to build a tensorflow compute graph that gets executed at the end (i.e. not using eager evaluation). At least for me that is quite a different way of programming compared to regular python or C. Also, it is quite GPU memory hungry as it allocates the arrays for the whole compute graph IIRC where other implementations might perform some operations in place.

GPU memory is a problem in general for real-world data such as SPIM volumes. In particular with lower-end consumer GPUs where OpenCL is often the only option, this will limit the practical applicability unless one also writes code that breaks the volumes into overlapping blocks, deconvolves them individually and fuses them back together. Dask provides a lot of the mechanics to do this easily but at the cost of quite a bit of computational overhead (see here).

1 Like

Are these wrappers going both ways?? It would be nice to understand how to call c++ opencl code, from java while avoiding memory transfers. I played around with ClearCLBuffer a bit, and managed to get the long pointer, but couldn’t do anything useful with it on the c++ side.

That is probably because I was using a different cl_context on the c++ side and a different cl_command_queue and I don’t think that it is safe to use memory from one context with another. Plus it’s unclear to me what cl_mem represents on the java and c++ side. It looks like it’s a structure on the c++ side, and some kind of native pointer on the java side.

In an ideal world it would be nice to be able to pass core opencl structures (cl_context, cl_command_queue, cl_mem) from java to c++, to avoid re-initializing the framework, and (as you mention) also avoid extraneous gpu to host back to gpu memory transfers.

Brian

1 Like