Dear friends of GPU-accelerated image processing and #clij,
in this thread we will answer all the Questions & Answers we collected during our NEUBIAS Academy webinar “GPU Accelerated Image Processing with CLIJ2” (available soon on Youtube). Thanks and credit for answering these questions goes to:
- Bram van den Broek, BioImaging Facility / Dept. of Cell Biology, Netherlands Cancer Institute, Netherlands
- Romain Guiet, BIOP-EPFL Lausanne, Switzerland
- Matthias Arzt, Myers lab / Jug Lab, CSBD / MPI CBG, Dresden, Germany
- Marion Louveaux, Institut Pasteur Paris, France
- Robert Haase, Myers lab, CSBD / MPI CBG, Dresden, Germany
Furthermore, the materials provided during the webinar remains available online.
Questions have been a bit curated for typos. If some context was lost through this, please answer below and let’s clarify. Enjoy and post any missing question here!
Q&As Table of contents:
- GPU-accelerated image processing
- Usage in ImageJ Macro
- Graph-based image processing
- Hardware compatibility and cloud computing
- Working with big image data
- Technical details
- Other software and future perspectives
It comes from the Open Computing Language OpenCL. CL = Computing Language
CLIJ is not based on CUDA. It uses the OpenCL standard, which is supported by most graphic cards.
Yes. All pixels you push to the GPU count independent from how they are grouped in dimensions.
Q4 (15:44) When discussing memory requirements for GPU-processing, does this refer to the size of the entire set of images that must be processed at once (a xyzt hyperstack for volumetric measurements)?
Yes, if you need to process the images at once, they need to fit into GPU memory. But if you can process them frame by frame or slice by slice, then only the single frames or slices need to fit on the GPU.
Technically this is feasible, and the crop and paste tutorial could be a starting point. It will be challenging though for two reasons: 600 GB are far beyond your GPUs memory. Furthermore, registration algorithms are not implemented in CLIJ yet.
GPU-accelerated image processing
CLIJ supports parallel access to multiple GPUs. In ImageJ macro, you cannot exploit multiple GPUs.
Therefore you need to use an object oriented programming language. There are code examples available for Jython and Java.
Depending on your workflow you might be able to process your data faster. You are responsible for dividing tasks between GPUs. The easiest way for this is processing one image on GPU1 and another image (tile, time point, …) on another GPU2. Transferring data from one GPU to another GPU and back means two push and two pull commands. This may kill the speedup.
The GPU memory bandwidth is the important parameter. Please note that a modern GPU in an older workstation may suffer from suboptimal motherboard parameters (e.g. PCI bus speed).
Likely yes. You may only notice that by comparing high-end eGPUs with internal GPUs.
It appears reasobable that when using GPUs using GDDR5 RAM push and pull will not be much slower via a Thunderbolt cable compared to an internal GPU.
Every operation in CLIJ is multithraded as OpenCL is multithreaded per default.
Usage in ImageJ Macro
Using WEKA in combination with CLIJ appears very reasonable. CLIJx provides functions for that. Please note that these are experimental operations.
Yes, every dialog asks for the GPU to use. In ImageJ macro you identify the GPU when initializing it.
You find details in the basics tutorial.
No, it takes the image/window title. We have decided that intentionally as it makes debugging workflows easier: The image name/title is visible on screen. The imageID is not. CLIJ2 ensures internally that generated images names do not appear twice.
Yes. In CLIJ images are limited to three dimensions. However, you can process image stacks and sequences time-point by time-point and channel by channel sequentially. This is anyway recommended as multi-channel multi-frame datasets are often too large for the GPU memory. Furthermore, for example blurring between channels should be avoided anyway. There is an example macro using CLIJ for processing a three-channel RGB image:
Technically that might be feasible. In order to implement that efficiently you may have to learn OpenCL.
CLIJx has a saveAsTif function for that. It’s very likely that this method will become part of CLIJ2 before release in a month.
To get started, take a look at the plugin-template for CLIJ. The upcoming template for CLIJ2 will be very similar. Depending on how deep you want to dive, you may have to learn OpenCL.
Yes, the used objects and classes are very different. ImagePlus in ImageJ corresponds to ClearCLBuffer in CLIJ for example. And the IJ class corresponds to the clij2 object of type CLIJ2.
Example code for virtually all commands is available in the API reference. The object/class hierarchy in CLIJ and CLIJ2 is also intentionally much simpler compared to ImageJ. Take a look at the Java examples for using CLIJ. CLIJ2 examples will follow soon.
You can combine any plugins in workflows that run inside Fiji and CLIJ if they support ImageJ macro.
Technically, this should be feasible, yes.
Unfortunately not. If you replace on the ImageJ macro cheat sheets, “Ext.CLIJ2_” with “clij2.”, you basically have it.
You can push and pull imglib2 objects, yes. There is an example in Java/CLIJ available online.
Q39 (16:16) Hi, I had problems in max projecting a 4D dataset in CliJ1, it seems it does not recognize it as a movies and just project all the images in one, whereas I would like to project frame by frame. Is this possible in Clij2?
Yes, you need to process your time-lapse frame by frame. You could for example use the method
pushCurrentZStack() in a for-loop to achieve this.
All operations in CLIJ are parallelized. You can push an X-Y-Time stack to the GPU memory and process it. So technically yes. Four-dimensional images (X-Y-Z-Time) are not supported. It is recommended to process time lapses frame by frame or in spatially separated blocks.
In macro language this happens sequentially. When using object oriented programming languages which support multi-threading you can do this. You may gain additional speedup of approximately factor 2 by doing so.
Brian Northon made FFT in OpenCL possible and it is accessible via CLIJx. Read more in this thread on the image.sc forum.
Graph-based image processing
There might actually be cameras having non-square shaped pixels. However, CLIJ2 supports these data structures to analyse cells.
It only relies on the matrices describing the connectivity in the graph. You find example workflows among the tutorials.
Excellent question! The Tribolium becomes multi-layered at that stage, but 25 neighbors is not realistic.
Obviously the cell-segmentation is not precise enough in this region of the embryo. Furthermore, potential connections to jolk-nuclei sitting below the surface are considered when counting neighbors.
We didn’t benchmark the mentioned Xeon Gold CPUS. However, CPUs, independent from number of cores or clock rate, have the drawback of they use DDR4 RAM. GDDR5/6 RAM is faster in general. As image processing is typically memory bound, the memory bandwith of the RAM is decisive.
You can measure the distance between neighbors using the averageDistanceOfTouchingNeighbors and averageDistanceOfNClosestPoints. Measuring distances between borders might be achievable utilizing a distance map.
You can for example measure the distane between any points at two timepoints using the generateDistanceMatrix method. This could be a starting point for implementing cell tracking on GPUs.
If you can extract spot positions from your images (in 2D or 3D), the shown workflows should be applicable.
They will be executed on CPU if they have been programmed for that. If you would favor specific plugins to be implemented on the GPU, please get in touch to guide the developers of the clEsperanto project.
All operations in CLIJ are paralellized implementations as OpenCL is processing in parallel per default.
Furthermore, the time domain is also just a dimension. You can for example push an X-Y-Time stack to the GPU and process it as any other 3D stack.
You can get a neighbor matrix of neighbors using the function neighborsOfNeighbors. Internally, this it is just squaring the touch matrix.
Further levels are possible by iteratively calling https://clij.github.io/clij2-docs/reference_neighborsOfNeighbors. Please note that in graph theory, that if you have three cells which span a graph A-B-C: A is a second order neighbor of A itself as you go from A to B and from B to A. Two steps, leading back to the origin. You can avoid that by setting the diagonale in the touch-matrix to zero using the setWhereXequalsY method.
Yes, the images you have seen during the presentation where generated like this.
With the replaceIntensities method applied to a labelmap and a given measurement vector, you can generate any kind of parametric image.
Color coding comes by applying lookup tables as usual in ImageJ. You may be able to implement the desired functionality by exploiting distance maps.
There is a function to skeletonize in CLIJx skeletonize. There is a deconvolveFFT function programmed in collaboration with Brian Northon @bnorthan (actually he did most of the work): Read more in this thread on the image.sc forum. Please note that these methods are experimental.
In ImageJ macro, you can open images using the open method.
clEsperanto is planned to release in June 2021. Meanwhile, you have the choice between different versions of CLIJ: CLIJ1 and CLIJ2 for Fiji/ImageJ, clicy for Icy, clatlab for Matlab. Internally, all use the same code running on the GPU. Push/pull commands may be slower/faster depending on what images you transfor to/from GPU.
Awesome. Let us know how it goes!
The presentation slide 12 may serve as such.
Yes, check out the tutorials in chapter “Working with matrices and graphs”.
For now the examples only cover CLIJ. CLIJ2 examples will follow. Furthermore, CLIJ2 supports auto-completion for CLIJ2 commands in Fijis script editor.
This is unfortunately not possible. However, all plugins that are compatible to ImageJ/Fiji and are accessible via ImageJ macro can be combined with CLIJ in macros.
Hardware compatibility and cloud computing
Yes it is. Don’t expect huge speedup though. The integrated Intel HD 5500 graphics came 2015 to the market. Recent integrated GPUs have increased performance. Actually, if your computer has 64 GB of memory, the integrated GPU has access to 32 GB of memory. That’s much more than on dedicated graphics cards. Thus, there are also advantages when using integrated GPUs.
Usage wise there are no differences. Performance differs between models and vendors. When comparing resulting pixels, we observed tiny differences between Intel and NVidia, presumably related to numeric rounding errors. Details are reported in the supplemental material of the CLIJ paper.
Erick Martins Ratamero @erickratamero investigated that deeper: He was the person using CLIJ in the cloud, and they did it by using a VM instance on Google Cloud. You get full admin access to the VM, so it was just a matter of installing drivers. There is an example implementation using the Zeiss Apeer Cloud and Java to access CLIJ in the cloud.
Integrated GPUs from Intel (HD, Iris) and AMD (Vega, RX) work nicely with CLIJ.
There is an example implementation using the Zeiss Apeer Clound and Java to access CLIJ.
NVidia graphics cards which support OpenCL (virtually all GPUs do) also support CLIJ.
Check out the Intel website to find out prices of Intel CPUs.
Yes, that GPU was tested in an MacBook air and works with CLIJ.
Working with big image data
That appears feasible. Please note that HDF5 is a file format. CLIJ doesn’t support any specific file formats. It just processes images. File handling is done by other libraries and plugins.
Q12 (15:47) A way of working around the Memory size limit should be to cut the data into small enough parts (which are in rough accordance with the memory rule of thumb), correct?
Q42 (16:21) Does CLIJ has functions to split datasets into small blocks ? It seems it has to be done prior pushing the data, but this seems laso an operation that could benefit from GPU acceleration ?
Q34 (16:13) Can you process an image in blocks if you don’t have enough memory?
Q32 (16:10) Is there some hints about the memory consumption for the different functions? I mean it’s clear that I always need memory for input and output image. But more complex operation might use temporary memory which I have to take into account.
You can debug memory consumption by calling
reportMemory before and after executing an operation to find out how much memory is consumes.
Q35 (16:13) How to handle 1 TB light sheet data set (xyzt)? Is it possible to break data set in the time domain with overlapping 2-3 time points, process individually, and then find common objects within the ovelapping time points, and fuse all the output data together at the very end. What is the best way to do it with TB xyzt data sets?
You can process data time-point by time-point. Depending on size of the image stacks per time point, you can push several of those and combine them during your analysis.
Handling depends on the desired analysis. Thus, there is no general answer. The easiest accessible ways are downsampling and slice-by-slice processing.
pushCurrentSlice for example.
Q52 (16:33) GPU usually have a limit on image sizes. What happens if my image provides 4k 8k 16k objects?
Q75 (16:56) If my image contains 16k objects it will produced a distance matrix of 16k x 16k pixels. I know there is limits on image dimensions. So does it depends on the graphics cards or on OpenCL?
You can process images in the GPU which are larger than images that can be handled by ImageJ.
Processing touch and distance matrices with 16k objects and more are absolutely feasible. However you cannot push/pull these images. Thus, you need to treat the matrices exclusively in the GPU and only push/pull points lists and meshes. You may alternatively use crop and paste to push/pull such images tile-by-tile.
Other software and future perspectives
The recorder in Icy was developed specifically for CLIJ2 and is at the moment limited to Clicy.
We would love to make this happen for C/C++. The ITK developers are also very open to support clEsperanto in this regard. You can read more about the goals and aims in the preliminary
clEsperanto roadmap. Get in touch with the clEsperanto team to learn more.
Potentially, yes. CLIJ2 replaces CLIJ with a transition period of at least one year. Furthermore, they are implemented with best-possible backwards-compatibility. CLIJ may stay in its current state after a year transition time. After that it will be no longer be maintained but still available. As soon as clEsperanto is released, the same will happen with CLIJ2. Read more about the release cycle online.
As you can see in the video the used commands are identical between the languages. No translation is necessary.