[NEUBIAS Academy@Home] Webinar "GPU-accelerated image processing with CLIJ2" + Questions & Answers

Dear friends of GPU-accelerated image processing and #clij,

in this thread we will answer all the Questions & Answers we collected during our NEUBIAS Academy webinar “GPU Accelerated Image Processing with CLIJ2” (available soon on Youtube). Thanks and credit for answering these questions goes to:

  • Bram van den Broek, BioImaging Facility / Dept. of Cell Biology, Netherlands Cancer Institute, Netherlands
  • Romain Guiet, BIOP-EPFL Lausanne, Switzerland
  • Matthias Arzt, Myers lab / Jug Lab, CSBD / MPI CBG, Dresden, Germany
  • Marion Louveaux, Institut Pasteur Paris, France
  • Robert Haase, Myers lab, CSBD / MPI CBG, Dresden, Germany

Furthermore, the materials provided during the webinar remains available online.

Questions have been a bit curated for typos. If some context was lost through this, please answer below and let’s clarify. Enjoy and post any missing question here!

Q&As Table of contents:

Introduction

Q1 (04:12 PM) What does CL in CLIJ stand for?

It comes from the Open Computing Language OpenCL. CL = Computing Language

Q2 (15:38) I assume it is CUDA based?

CLIJ is not based on CUDA. It uses the OpenCL standard, which is supported by most graphic cards.

Q3 (15:44) Is a stack considerd as one image? (Related to the rule of thumb)

Yes. All pixels you push to the GPU count independent from how they are grouped in dimensions.

Q4 (15:44) When discussing memory requirements for GPU-processing, does this refer to the size of the entire set of images that must be processed at once (a xyzt hyperstack for volumetric measurements)?

Yes, if you need to process the images at once, they need to fit into GPU memory. But if you can process them frame by frame or slice by slice, then only the single frames or slices need to fit on the GPU.

Q5 (15:45) I wrote a script with Fiji to stich huge light sheet images (~600Gb), could I use Cliji to accelerate it? I have RTX8000 GPU

Technically this is feasible, and the crop and paste tutorial could be a starting point. It will be challenging though for two reasons: 600 GB are far beyond your GPUs memory. Furthermore, registration algorithms are not implemented in CLIJ yet.

GPU-accelerated image processing

Q9 (15:46) Does it work with mutliple GPUs?

CLIJ supports parallel access to multiple GPUs. In ImageJ macro, you cannot exploit multiple GPUs.
Therefore you need to use an object oriented programming language. There are code examples available for Jython and Java.

Q17 (15:51) In these graphs did you account for GPU loading time?

The graphs originating from the paper and also available on the website only account for pure processing time. That’s why we benchmarked a workflow separately which includes push and pull times.

Q18 (15:53) Two GPUs will enable larger files to be processed or just make the processing faster or both than one?

Depending on your workflow you might be able to process your data faster. You are responsible for dividing tasks between GPUs. The easiest way for this is processing one image on GPU1 and another image (tile, time point, …) on another GPU2. Transferring data from one GPU to another GPU and back means two push and two pull commands. This may kill the speedup.

Q19 (15:53) Which parameter of the card determines the push/pull speed?

The GPU memory bandwidth is the important parameter. Please note that a modern GPU in an older workstation may suffer from suboptimal motherboard parameters (e.g. PCI bus speed).

Q20 (15:54) Will the push and pull times be much different if using an eGPU with a USB-C?

Likely yes. You may only notice that by comparing high-end eGPUs with internal GPUs.
It appears reasobable that when using GPUs using GDDR5 RAM push and pull will not be much slower via a Thunderbolt cable compared to an internal GPU.

Q21 (15:54) Is Gaussian blur filter multi-threaded?

Every operation in CLIJ is multithraded as OpenCL is multithreaded per default.

Usage in ImageJ Macro

Q22 (15:54) Is it possible to use GPUs when using machine learning workflow. like doing segmentation in WEKA plugging in imageJ ?

Using WEKA in combination with CLIJ appears very reasonable. CLIJx provides functions for that. Please note that these are experimental operations.

Q23 (16:01) Is there a command to choose between a discrete or integrated graphics card on a laptop?

Yes, every dialog asks for the GPU to use. In ImageJ macro you identify the GPU when initializing it.
You find details in the basics tutorial.

Q24 (16:02) Does the CLIJ2_push() function take an imageID as an argument?

No, it takes the image/window title. We have decided that intentionally as it makes debugging workflows easier: The image name/title is visible on screen. The imageID is not. CLIJ2 ensures internally that generated images names do not appear twice.

Q25 (16:02) Is there a limit in the dimensiosn ? CZT…?

Yes. In CLIJ images are limited to three dimensions. However, you can process image stacks and sequences time-point by time-point and channel by channel sequentially. This is anyway recommended as multi-channel multi-frame datasets are often too large for the GPU memory. Furthermore, for example blurring between channels should be avoided anyway. There is an example macro using CLIJ for processing a three-channel RGB image:

Q26 (16:02) Can we integrate Neural nets with clij to execute the processing on GPU?

Technically that might be feasible. In order to implement that efficiently you may have to learn OpenCL.

Q27 (16:05) Can you write images directly to the hard drive? without pulling result to an ImageJ window

CLIJx has a saveAsTif function for that. It’s very likely that this method will become part of CLIJ2 before release in a month.

Q28 (16:08) Is there an integration for Python?

There are two functional prototypes to call CLIJ(2) operations from Python 3 which will eventually be integrated with clEsperanto: pyClEsperanto and clijpy.

Q29 (16:09) How can we develop a plugin using CLij2?

To get started, take a look at the plugin-template for CLIJ. The upcoming template for CLIJ2 will be very similar. Depending on how deep you want to dive, you may have to learn OpenCL.

Q30 (16:10) Or more specific, is there much difference between ij1 programming and clij2 programming

Yes, the used objects and classes are very different. ImagePlus in ImageJ corresponds to ClearCLBuffer in CLIJ for example. And the IJ class corresponds to the clij2 object of type CLIJ2.
Example code for virtually all commands is available in the API reference. The object/class hierarchy in CLIJ and CLIJ2 is also intentionally much simpler compared to ImageJ. Take a look at the Java examples for using CLIJ. CLIJ2 examples will follow soon.

Q31 (16:10) So, it’s possible to run any pluggins? Even the ones not present in the basic FiJi?

You can combine any plugins in workflows that run inside Fiji and CLIJ if they support ImageJ macro.

Q33 (16:12) Can this be implemented to CellProfiler as well?

Technically, this should be feasible, yes.

Q36 (16:13) Is there a cheat sheet or object oriented languages?

Unfortunately not. If you replace on the ImageJ macro cheat sheets, “Ext.CLIJ2_” with “clij2.”, you basically have it.

Q38 (16:15) Does clij2 work also with imglib2 objects/arrays?

You can push and pull imglib2 objects, yes. There is an example in Java/CLIJ available online.

Q39 (16:16) Hi, I had problems in max projecting a 4D dataset in CliJ1, it seems it does not recognize it as a movies and just project all the images in one, whereas I would like to project frame by frame. Is this possible in Clij2?

Yes, you need to process your time-lapse frame by frame. You could for example use the method pushCurrentZStack() in a for-loop to achieve this.

Q40 (16:20) Can you process limited number of time points in parallel?

All operations in CLIJ are parallelized. You can push an X-Y-Time stack to the GPU memory and process it. So technically yes. Four-dimensional images (X-Y-Z-Time) are not supported. It is recommended to process time lapses frame by frame or in spatially separated blocks.

Q41 (16:21) Is CLIJ really making stitching faster ? Isn’t it an I/O limitation in stitching ?

CLIJ doesn’t have any methods specifically for stitching. However, affine transforms and image warping are commonly applied during stitching and are supported by CLIJ.

Q44 (16:23) Does CLIJ2 have commands to align images over long timecourses? (this can take a lot of time on a CPU)

CLIJx has some methods available such as translationRegistration, translationTimelapseRegistration and deformableRegistration2D. Take care, these are experimental functions, not fully tested.

Q45 (16:23) Is it possible to push a second image to the GPU memory while processing an already pushed first one?

In macro language this happens sequentially. When using object oriented programming languages which support multi-threading you can do this. You may gain additional speedup of approximately factor 2 by doing so.

Q46 (16:24) Whether clij implements the GPU accelerated version of fast Fourier transform (FFT)? I have to learn the openCL kernel to develop my own fiji plugin? Could clij2 speed up this process?

Brian Northon made FFT in OpenCL possible and it is accessible via CLIJx. Read more in this thread on the image.sc forum.

Graph-based image processing

Q47 (16:24) From which instrument would you get pixels, that don’t have square shape?

There might actually be cameras having non-square shaped pixels. However, CLIJ2 supports these data structures to analyse cells.

Q48 (16:27) In graph-based filtering, do you take into account “pixel” area, or does it rely only on the connectivity?

It only relies on the matrices describing the connectivity in the graph. You find example workflows among the tutorials.

Q49 (16:30) Is 25 neighbors per cell (slide 41) real or an artifact? Is the tribolium a multi-layered epithelium?

Excellent question! The Tribolium becomes multi-layered at that stage, but 25 neighbors is not realistic.
Obviously the cell-segmentation is not precise enough in this region of the embryo. Furthermore, potential connections to jolk-nuclei sitting below the surface are considered when counting neighbors.

Q50 (16:31) Are there any comparative data between entry level Xeon Silver CPUs and high end Xeon Gold CPUs, or running it on a server with 500 CPU nodes (@TU Dresden, I guess)?

We didn’t benchmark the mentioned Xeon Gold CPUS. However, CPUs, independent from number of cores or clock rate, have the drawback of they use DDR4 RAM. GDDR5/6 RAM is faster in general. As image processing is typically memory bound, the memory bandwith of the RAM is decisive.

Q51 (16:31) Is there border-to-border distance - for non touching objects ? and also is there a way to get the distance to neighbor of neighbor ?

You can measure the distance between neighbors using the averageDistanceOfTouchingNeighbors and averageDistanceOfNClosestPoints. Measuring distances between borders might be achievable utilizing a distance map.

Q53 (16:33) Is it possible to link object through time ? and have a vector like « displacement » for exemple ? How do you do if you analyze each time point separately as you propose for big data ?

You can for example measure the distane between any points at two timepoints using the generateDistanceMatrix method. This could be a starting point for implementing cell tracking on GPUs.

Q54 (16:35) Are there minimal specification to your images in orther to perform the ecxatly the same analysis you just showed to graph theory?

If you can extract spot positions from your images (in 2D or 3D), the shown workflows should be applicable.

Q55 (16:36) I follow up on the question by Helene. Can we run other FIJI plugins (not included in CLIJ/CLIJ2) directly on the GPU? Or will those plugins run on the CPU? Thanks!

They will be executed on CPU if they have been programmed for that. If you would favor specific plugins to be implemented on the GPU, please get in touch to guide the developers of the clEsperanto project.

Q56 (16:36) Is parallel processing in the time domain hard to achieve?

All operations in CLIJ are paralellized implementations as OpenCL is processing in parallel per default.
Furthermore, the time domain is also just a dimension. You can for example push an X-Y-Time stack to the GPU and process it as any other 3D stack.

Q57 (16:36) and also to get the number of neighbor-of neighbors ?

You can get a neighbor matrix of neighbors using the function neighborsOfNeighbors. Internally, this it is just squaring the touch matrix.

Q58 (16:36) and further level ?

Further levels are possible by iteratively calling https://clij.github.io/clij2-docs/reference_neighborsOfNeighbors. Please note that in graph theory, that if you have three cells which span a graph A-B-C: A is a second order neighbor of A itself as you go from A to B and from B to A. Two steps, leading back to the origin. You can avoid that by setting the diagonale in the touch-matrix to zero using the setWhereXequalsY method.

Q59 (16:37) Can we reassign measurements to the label map where the measurement value reprresents pixel intensities and make a 3D ´heatmap-like output demostrating that measurement in CLIJ2?

Yes, the images you have seen during the presentation where generated like this.
With the replaceIntensities method applied to a labelmap and a given measurement vector, you can generate any kind of parametric image.

Q60 (16:38) Can you use a specific region of an image (i.e. nucleus) as a bin center and generate color-coded distance maps to other regions of the cell?

Color coding comes by applying lookup tables as usual in ImageJ. You may be able to implement the desired functionality by exploiting distance maps.

Technical details

Q66 (16:44) Hi - two questions 1) is there any skeleton 3D functions? 2) is there an FFT 3D support? Thanks!

There is a function to skeletonize in CLIJx skeletonize. There is a deconvolveFFT function programmed in collaboration with Brian Northon @bnorthan (actually he did most of the work): Read more in this thread on the image.sc forum. Please note that these methods are experimental.

Q67 (16:46) Question for when there is enough time, i have tried the tiling, how would you import a local file instead of how the data is imported in the demo?

In ImageJ macro, you can open images using the open method.

Q68 (16:46) What is better, Icy, Matlab or will eventually become Esperanto?

clEsperanto is planned to release in June 2021. Meanwhile, you have the choice between different versions of CLIJ: CLIJ1 and CLIJ2 for Fiji/ImageJ, clicy for Icy, clatlab for Matlab. Internally, all use the same code running on the GPU. Push/pull commands may be slower/faster depending on what images you transfor to/from GPU.

Q69 (16:48) Do you have any plans to implement Clij or clesperanto to process single molecule localization microscopy images/maps and do cluster analysis?

At the moment not. The NanoJ-Library (paper)supports processing data of that kind and is GPU-accelerated using OpenCL already.

Q70 (16:49) Thank you. Very inspiring talk. We are interested in handling of large data sets (tilescans, timelapses and alike) and parallel batch image processing via GPUs/CPUs

Awesome. Let us know how it goes!

Q71 (16:51) Is there a cheat sheet to assess if the workflow would benefit from GPU acceleration?

The presentation slide 12 may serve as such.

Q72 (16:52) Can CLIJ2 take imglib2 Views as an input?

Yes, you can push/pull imglib2 RandomAccessibleIntervals. There is an example in Java/CLIJ available online

Q73 (16:55) Any suggested homework for the graph-based analysis?

Yes, check out the tutorials in chapter “Working with matrices and graphs”.

Q174(16:56) Just to make sure I got it right - can I use clij with jython scripting in fiji ?

For now the examples only cover CLIJ. CLIJ2 examples will follow. Furthermore, CLIJ2 supports auto-completion for CLIJ2 commands in Fijis script editor.

Q76 (16:59) Is there an implementation of image FFT available in CLIJ or CLIJ2? If not, do you plan to have it implemented?

See this thread on the image.sc forum. Brian Northan @bnorthan is making the FFT / deconvolution using CLIJ and ImageJ-ops possible.

Q77 (16:59) Is there a clij2 function for image registration? say to counter x,y drift over time?

CLIJx has some methods available such as translationRegistration, translationTimelapseRegistration and deformableRegistration2D. Take care, these are experimental functions, not fully tested.

Q78 (17:02) How do you run another FIJI plugin in CLIJ (i.e. TANGO)

This is unfortunately not possible. However, all plugins that are compatible to ImageJ/Fiji and are accessible via ImageJ macro can be combined with CLIJ in macros.

Hardware compatibility and cloud computing

Q6 (15:45) Hi, is it also useful to run CLIJ if you only have Intel HD graphics 5500 in a (high-end) laptop? Is the speed gain enough to warrant the effords?

Yes it is. Don’t expect huge speedup though. The integrated Intel HD 5500 graphics came 2015 to the market. Recent integrated GPUs have increased performance. Actually, if your computer has 64 GB of memory, the integrated GPU has access to 32 GB of memory. That’s much more than on dedicated graphics cards. Thus, there are also advantages when using integrated GPUs.

Q7 (15:46) AMD or Nvidia ? Is there a difference with CLIJ?

Usage wise there are no differences. Performance differs between models and vendors. When comparing resulting pixels, we observed tiny differences between Intel and NVidia, presumably related to numeric rounding errors. Details are reported in the supplemental material of the CLIJ paper.

Q8 (15:46) Can we use GPU from cloud ?

Erick Martins Ratamero @erickratamero investigated that deeper: He was the person using CLIJ in the cloud, and they did it by using a VM instance on Google Cloud. You get full admin access to the VM, so it was just a matter of installing drivers. There is an example implementation using the Zeiss Apeer Cloud and Java to access CLIJ in the cloud.

Q11 (15:46) So you cannot use any integrated GPU and only GPGPUs

Integrated GPUs from Intel (HD, Iris) and AMD (Vega, RX) work nicely with CLIJ.

Q13 (15:47) What about sending workflows to cloud based GPU resources? Is this possible with CLIJ?

There is an example implementation using the Zeiss Apeer Clound and Java to access CLIJ.

Q14 (15:47) How well the NV support?

NVidia graphics cards which support OpenCL (virtually all GPUs do) also support CLIJ.

Q15 (15:48) How dual Xeon Gold 6254 CPUs will perform. Xeon Silver 4110 is 4C and on a budget side?

Check out the Intel website to find out prices of Intel CPUs.

Q16 (15:51) Sorry but on a similar note: would this be a decent card: Intel HD Graphics 530 1536 MB?

Yes, that GPU was tested in an MacBook air and works with CLIJ.

Working with big image data

Q10 (15:46) Why one cannot break image into pieces (hdf5) and do it piece by piece?

That appears feasible. Please note that HDF5 is a file format. CLIJ doesn’t support any specific file formats. It just processes images. File handling is done by other libraries and plugins.

Q12 (15:47) A way of working around the Memory size limit should be to cut the data into small enough parts (which are in rough accordance with the memory rule of thumb), correct?
Q42 (16:21) Does CLIJ has functions to split datasets into small blocks ? It seems it has to be done prior pushing the data, but this seems laso an operation that could benefit from GPU acceleration ?
Q34 (16:13) Can you process an image in blocks if you don’t have enough memory?

Yes, that’s the way to go. Check out the crop and paste example. Furthermore, CLIJx supports pushing and pulling tiles. Take care, these are experimental functions, not fully tested.

Q32 (16:10) Is there some hints about the memory consumption for the different functions? I mean it’s clear that I always need memory for input and output image. But more complex operation might use temporary memory which I have to take into account.

You can debug memory consumption by calling reportMemory before and after executing an operation to find out how much memory is consumes.

Q35 (16:13) How to handle 1 TB light sheet data set (xyzt)? Is it possible to break data set in the time domain with overlapping 2-3 time points, process individually, and then find common objects within the ovelapping time points, and fuse all the output data together at the very end. What is the best way to do it with TB xyzt data sets?

You can process data time-point by time-point. Depending on size of the image stacks per time point, you can push several of those and combine them during your analysis.

Q37 (16:14) How would you handle large 3D tilescan, larger than 5000x5000x1000 in xyz?

Handling depends on the desired analysis. Thus, there is no general answer. The easiest accessible ways are downsampling and slice-by-slice processing.

Q43 (16:22) So, the partial push and pull should make it possoble to process a dataset larger than the GPU memory( or 1/4th of it)?

Yes. Try pushCurrentZStack or pushCurrentSlice for example.

Q52 (16:33) GPU usually have a limit on image sizes. What happens if my image provides 4k 8k 16k objects?
Q75 (16:56) If my image contains 16k objects it will produced a distance matrix of 16k x 16k pixels. I know there is limits on image dimensions. So does it depends on the graphics cards or on OpenCL?

You can process images in the GPU which are larger than images that can be handled by ImageJ.
Processing touch and distance matrices with 16k objects and more are absolutely feasible. However you cannot push/pull these images. Thus, you need to treat the matrices exclusively in the GPU and only push/pull points lists and meshes. You may alternatively use crop and paste to push/pull such images tile-by-tile.

Other software and future perspectives

Q61 (16:37) pls pls pls explanation on macro recirding for clicy!!!

You find the video showing it online. It is supposed to work as in ImageJ but records Icy Javascript instead of ImageJ Macro. Feel free to ask specific questions in the thread below.

Q62 (16:39) Does the ICY recorder also work for the “normal” ICY?

The recorder in Icy was developed specifically for CLIJ2 and is at the moment limited to Clicy.

Q63 (16:43) What about clEsperanto in C or C++? Do you plan to make such connections?

We would love to make this happen for C/C++. The ITK developers are also very open to support clEsperanto in this regard. You can read more about the goals and aims in the preliminary
clEsperanto roadmap. Get in touch with the clEsperanto team to learn more.

Q64 (16:43) So, will CLesperanto replace CLIJ in the future ?

Potentially, yes. CLIJ2 replaces CLIJ with a transition period of at least one year. Furthermore, they are implemented with best-possible backwards-compatibility. CLIJ may stay in its current state after a year transition time. After that it will be no longer be maintained but still available. As soon as clEsperanto is released, the same will happen with CLIJ2. Read more about the release cycle online.

Q65 (16:44) Does ClEsperanto convert a language when it copy pastes or is it a new language?

As you can see in the video the used commands are identical between the languages. No translation is necessary.

Esperanto :wink:

9 Likes

Great webinar Robert !

Is there a way in clij to relate two label maps (eg one for cells and one for vesicles) to produce the relation between them - eg in order to assign each cell with the number of vesicles it includes, or to assign each vesicle with the label of the parent cell ?
or anything similar to 3D MereoTopology

thanks
Ofra

1 Like

Hey @Ofra_Golani,

not yet. But it sounds very interesting and potentially an easy to add feature. What would be the desired output of that operation? Unfortunately the link above leads to a 404 page when it comes to the interesting details.

I could for example make it possible to generate a touch-matrix from two labelmaps… (easy to incorporate). But what would you like to do next with it? Counting how many labels in A overlap with a specific label in B? Or more involved stuff?

Thanks for the inspiration! It sounds really interesting!

Cheers,
Robert

1 Like

Hi @haesleinhuepf

In one current 2D project I only need to count the number of B-Objects, within each A Object, and then threshold to keep only the A objects with more than N, B objects.
In Fiji macro I do it by running UltimatePoint on the B Objects label map, binarize it (and divide by 255). From RoiManager which include only the A objects, I measure the intensity of the Ultimate point image, and then use this for the keeping only A-objects with value > N.

In previous 3D project I used 3D ImageJ suite for counting B-Objects that are included or partially included in A-Objects.

I think that touch matrix can work for this. I suppose from there it is like counting neighbors, and you already have example for this, right ?
In general maybe non-binary touch matrix can work - with values that indicate the relation between A&B objects as in MereoTopology) see Spatial Reasoning paper.

Ofra

2 Likes

Hey @Ofra_Golani,

You are right. If you have the labelled spots image in slice1 and the labels in slice 2, one could achieve that. However, I think there is an easier way which also works in 3D:

Define test data: binary spot image and label map

run("Close All");

// Define test data
binary_array = newArray(
	0, 0, 0, 0, 0,
	1, 0, 1, 0, 1,
	0, 1, 0, 0, 0,
	0, 0, 0, 0, 1,
	0, 0, 0, 1, 0);

label_array = newArray(
	1, 1, 1, 2, 2,
	1, 1, 1, 2, 2,
	1, 1, 0, 2, 2,
	4, 4, 4, 3, 3,
	4, 4, 4, 3, 3);

width = 5;
height = 5;
depth = 1;

Initialize GPU and push example images to GPU memory

run("CLIJ2 Macro Extensions", "cl_device=");
Ext.CLIJ2_clear();

Ext.CLIJ2_pushArray(binary_image, binary_array, width, height, depth);
Ext.CLIJ2_pushArray(label_image, label_array, width, height, depth);

// Show example images
Ext.CLIJ2_pull(binary_image);
setMinAndMax(0, 1);
zoom(100);

Ext.CLIJ2_pull(label_image);
setMinAndMax(0,4);
run("glasbey_on_dark");
zoom(100);

image image

Perform statistics to count the number of spots per label

We now measure the mean intensity in the binary image per label and multiply it with the area to find out how many spots there are for each label.

run("Clear Results");
Ext.CLIJ2_statisticsOfBackgroundAndLabelledPixels(binary_image, label_image);
// push two columns to GPU
Ext.CLIJ2_pushResultsTableColumn(mean_intensity, "MEAN_INTENSITY");
Ext.CLIJ2_pushResultsTableColumn(pixel_count, "PIXEL_COUNT");
run("Clear Results");

// do math with two vectors
// spot_count = mean_intensity * pixel_count
Ext.CLIJ2_multiplyImages(mean_intensity, pixel_count, spot_count);
print("\Clear");

// print out intermediate result
print("Number of spots per label:");
Ext.CLIJ2_print(spot_count);

> Number of spots per label:
> 0.0 3.0 1.0 2.0 0.0

Threshold a vector and make a binary vector

You can then use the methods greaterOrEqualConstant and excludeLabels to make a new filtered label map which only contains selected labels. You asked during the webinar for an exercise, so here you go. You find the soltuion online) and the result might look like this: :wink:

image

Let me know how it goes! You’d be one of the first users of that vector math stuff. I :heart: feedback!

I will have a close look at this in the meantime :wink:

Thanks for the link!

Cheers,
Robert

1 Like

FWIW, the source code for the 3D ImageJ Suite is here:

Nice talk, I will certainly test the macros for splitting my tomography data to smaller tiles/chunks. But on the other side more memory is always better :grimacing:

Has anyone used the Nvidia Tesla K80 cards with 24GB memory? They have 2 GPUs, but is the memory split between the GPU? Can CLIJ2 utilize the full 24GB for a single operation (e.g. at a tomography data set of 6-8 GB), or are there max 12GB for a CLIJ operation?

1 Like

Hi @Jurgen_Gluch,

thanks for the flowers! I haven’t tested the Tesla K80 yet. Does the cl_device pulldown in any dialog list two of them?
image

Furthermore, you could try to allocate more and more images and see when it crashes. :wink: Therefore you may want to use the method reportMemory as demonstrated in the Basics tutorial. In general, every GPU has only access to its own memory. Furthermore, from ImageJ Macro you cannot access multiple GPUs. In order to work in parallel with several GPUs, you need to use a object-oriented programming language such as Jython, Groovy, JavaScript or Java. There are code examples available for Jython and Java.

If you want to transfer data from one GPU to another, there is a shortcut which is a bit faster than push/pull. You can call (Jython example):

from net.haesleinhuepf.clij2 import CLIJ2;

source_clij2 = CLIJ2(0); # first GPU device
target_clij2 = CLIJ2(1); # second GPU device

image1_on_GPU1 = source_clij2.create(1024, 1024, 100);

image2_on_GPU2 = target_clij2.transfer(image1_on_GPU1);

or (theoretically even faster):

image1_on_GPU1 = source_clij2.create(1024, 1024, 100);
image2_on_GPU2 = target_clij2.create(1024, 1024, 100);

target_clij2.transferTo(image1_on_GPU1, image2_on_GPU2);

If you try that out, I’d be happy to learn how the performance is on your system!

Let me know how it goes :slight_smile:

Cheers,
Robert

Hi Robert,

I have another question. It seems that the onboard-GPU shares the main memory with the CPU, thus I can adress more memory for images. But its somehow limited in size per single image.

On my machine (32GB RAM) I have the 3 choises for “cl_device”:

  1. Intel® HD Graphics 630
  2. Intel® Core™ i7-7700 CPU @ 3.60GHz
  3. GeForce GTX 1060 3GB

this macro gives weird results:

run("CLIJ2 Macro Extensions");
gbyte = 3;
input = "image";
Ext.CLIJ2_clear();
Ext.CLIJ2_create3D(input, 1024, 1024, gbyte*1024, 8);
Ext.CLIJ2_set(input, 127);
Ext.CLIJ2_pull(input);
Ext.CLIJ2_reportMemory();
Ext.CLIJ2_clear();
  1. returns the 3GB stack, but from slice 2049 to the end the value is “0”, no error
    for “gbyte = 4;” I will get a error message (CL_OUT_OF_HOST_MEMORY)
    for up to “gbyte = 2;” it works as expected
  2. crashes Fiji (silent closing all windows)
    for up to “gbyte = 2;” it works as expected
  3. throws the expected error “CL_MEM_OBJECT_ALLOCATION_FAILURE”
    for up to “gbyte = 2;” it works as expected

On the other side I can load more images in the “Intel® HD Graphics 630” and process them as long as the images are max. 2GB and their sum is smaller then the shared GPU memory (15.9GB):

run("CLIJ2 Macro Extensions");
gbyte = 2;
input1 = "image1";
newImage(input1, "8-bit black", 1024, 1024, gbyte*1024);
input2 = "image2";
newImage(input2, "8-bit white", 1024, 1024, gbyte*1024);
input3 = "image3";
newImage(input3, "8-bit white", 1024, 1024, gbyte*1024);
input4 = "image4";
newImage(input4, "8-bit white", 1024, 1024, gbyte*1024);
Ext.CLIJ2_clear();
Ext.CLIJ2_push(input1);
Ext.CLIJ2_push(input2);
Ext.CLIJ2_push(input3);
Ext.CLIJ2_push(input4);
Ext.CLIJ2_multiplyImageAndScalar(input2, input3, 0.5);
Ext.CLIJ2_addImages(input1, input3, input4);
close("*");
Ext.CLIJ2_pull(input4);
Ext.CLIJ2_clear();

For 3GB per image I get “255” value from slice 2049, and an error message from 4GB.
Could this be an internal buffer of the GPU that limits a single image to 2GB ?

PS: I still looking for an opporunity to test CLIJ on a Tesla K80 card …

1 Like

Hey @Jurgen_Gluch,

interesting observation. I never hit the issue that this is hardware dependent. But obviously it is. Actually, you can ask CLInfo how large images are supposed to be:

run("CLIJ2 Macro Extensions", "cl_device=");
Ext.CLIJ2_clInfo();


I would call this a hardware limitation.

Can you partition your use case in smaller blocks?

On which hardware did this happen? I just tested on my Intel HD 620 and cannot reproduce it. All pixels have value 127.

Cheers,
Robert

1 Like

Good morning.

I ran

run("CLIJ2 Macro Extensions", "cl_device=");
Ext.CLIJ2_clInfo();

This is the shortened output (I added value in GB for easier reading):

  [0] Intel(R) OpenCL
     [0] Intel(R) HD Graphics 630 
        GlobalMemorySizeInBytes: 13693231104 = 12.75 GB
        LocalMemorySizeInBytes: 65536 
        MaxMemoryAllocationSizeInBytes: 4294959104 ~ 4 GB
     [1] Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz 
        GlobalMemorySizeInBytes: 34233081856 
        LocalMemorySizeInBytes: 32768 
        MaxMemoryAllocationSizeInBytes: 8558270464 ~ 8 GB
        MaxWorkGroupSize: 8192 
  [1] NVIDIA CUDA
     [0] GeForce GTX 1060 3GB 
        GlobalMemorySizeInBytes: 3221225472 = 3 GB
        LocalMemorySizeInBytes: 49152 
        MaxMemoryAllocationSizeInBytes: 805306368 = 768 MB
        MaxWorkGroupSize: 1024 
Best GPU device for images: GeForce GTX 1060 3GB
Best largest GPU device: GeForce GTX 1060 3GB
Best CPU device: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz

The bad thing is that the Intel® HD Graphics 630 does not give any error for images between >2 and <4 GB. The * MaxMemoryAllocationSizeInBytes* for it are 4GB. My monitors are connected to the NVidia card and my OS is Win10.

I can probably can splitt the tomgraphy data to smaller chunks and process it. No Problem. So for my work its OK :+1:, but I have to keep in mind that I get corrupted results for images >2GB. The advantage of the Intel® HD Graphics 630 is that I can have up to 6 images á 2GB in RAM and calculate with them. The Geforce is faster, but can only hold one image with 1.9GB at a time …

Basically I test your cool stuff and try to implement it in my usual workflows. It’s fun :heart_eyes: and sometimes too fast to go to kittchen and grab a new coffee :rofl: .

I’ll check if I can find an HD 630 with a similar issue. Furthermore, that’s a good hint. I could actually build in a warning in case the user pushes/create images larger than allowed. Good point. I’ll do that.

I know what you mean :rofl: In fact GPU-acceleration doesn’t enable us getting results faster. It just generates more work on the humans shoulders because we can’t keep up with the computer anymore :slight_smile:

Thanks for the feedback. Very constructive. A pleasure working with you! :star_struck:

Cheers,
Robert

Just a follow up. The upcoming CLIJ2 BETA warns when you try to open images with size exceeding limitations. This should help understanding the following crash:

Thanks again for the suggestion!

Cheers,
Robert