OpenCL/GPU based image processing in ImageJ macro

fiji
imagej
macro
gpu

#1

Dear friends of GPU based image processing,
dear early adopters,

I recently put some efforts into making GPU-based image processing in ImageJ macro run. It has the potential to save us massive amounts of processing time. GPU programming in macro looks like this:

On the one hand it is still experimental and a bit far from release. On the other hand I’m happy to announce that it works on recent Intel, AMD and NVidia GPUs. I put everything on an ImageJ update site and wrote a blog post about how it can be used. Experimentalists among you may give it a try. Feedback is highly appreciated:

Cheers,
Robert


#2

Hello Robert -

Nice! This looks very promising.

I took a look at your blog post and wanted to ask about one oddity
I noticed.

Where you give timing results, both in your first (cold) run and
in your second (warmed-up) run you show pushing two images
taking less time than pulling one image. I know these sorts of
timings can have some slop, so maybe this is just noise, but
would one expect this kind of timing asymmetry between writing
to and reading from the gpu?

Thanks, mm


#3

Hey @mountain_man,

this is a very interesting question! I just digged in the code and found out that forth and back conversion are both simple while() loops in Java.Even though, they are different.

So, I will dig a bit deeper and see if one can squeeze out another 100 milliseconds. But really thanks a lot for pointing to this. Very well appreciated feedback!


#4

Hello Robert -

Thanks for the links to the relevant code.

I won’t pretend to understand your code (nor to be any good at
tuning java), but I do see one noteworthy difference between
your “forth” and “back” code.

When you are retrieving from the gpu (the “back” code), you
have your if-else tests on lInputType inside of the main
while (lCursor.hasNext()) loop.

This is the kind of thing that one might imagine the java compiler being
able to optimize away, but if it doesn’t, it could help explain the extra
cost of the “back” code. At the cost of a little code verbosity, you could
try having three individual while (lCursor.hasNext()) loops, each
inside of the appropriate if-else clause. (The value of lInputType is
a loop invariant, so this shouldn’t break anything.)

(Again, the compiler might be doing this for you already.)

(As an aside, you might consider declaring lInputType as final,
although it is unlikely that this would be telling the compiler anything
it hadn’t figured out already, so doing so would likely just be stylistic.)

Thanks, mm


#5

Hey @haesleinhuepf

Great stuff!
Tested it on intel® Core™ i7-6700 CPU @3.4 GHz with an Nvidia GeForce GTX 1016 (3GB), OpenCl 2.0.4, Win 10, 64-bit.

I used your example macro and exchanged the filter for Ext.CLIJ_median3d(input, blurred, 5, 5, 2);

Time needed for a single filter event:
GPU: ~4s
CPU: ~8s

Bigger kernel sizes were not supported.

Testing your 3D mean filter (kernel = 3/3/3) example ended up with the following results:
10x normal 3D mean filter from ImageJ
CPU mean filter no 1 took 1867
CPU mean filter no 2 took 1796
CPU mean filter no 3 took 1781
CPU mean filter no 4 took 1812
CPU mean filter no 5 took 1756
CPU mean filter no 6 took 1781
CPU mean filter no 7 took 1797
CPU mean filter no 8 took 1765
CPU mean filter no 9 took 1765
CPU mean filter no 10 took 1921

Runninf 10x the 3D mean filter via GPU
Pushing two images to the GPU took 109 msec
GPU mean filter no 1 took 68
GPU mean filter no 2 took 53
GPU mean filter no 3 took 56
GPU mean filter no 4 took 52
GPU mean filter no 5 took 54
GPU mean filter no 6 took 54
GPU mean filter no 7 took 53
GPU mean filter no 8 took 52
GPU mean filter no 9 took 53
GPU mean filter no 10 took 53
Pulining one image from the GPU took 210 msec

The 3D Median (kernel = 3/3/3)
CPU mean filter no 1 took 3285
CPU mean filter no 2 took 3281
CPU mean filter no 3 took 3265
CPU mean filter no 4 took 3199
CPU mean filter no 5 took 3161
CPU mean filter no 6 took 3218
CPU mean filter no 7 took 3354
CPU mean filter no 8 took 3296
CPU mean filter no 9 took 3109
CPU mean filter no 10 took 3155

The 3D Median (kernel = 3/3/3) on GPU
Pushing two images to the GPU took 63 msec
GPU mean filter no 1 took 711
GPU mean filter no 2 took 662
GPU mean filter no 3 took 645
GPU mean filter no 4 took 647
GPU mean filter no 5 took 649
GPU mean filter no 6 took 647
GPU mean filter no 7 took 645
GPU mean filter no 8 took 643
GPU mean filter no 9 took 650
GPU mean filter no 10 took 649
Pulining one image from the GPU took 125 msec

Sometimes the filters crashed but with various different error messages. With more time, I will test this again and report.


#6

Hey @mountain_man

Indeed, I guess the compiler takes care of that. I just tried, as you can see and did not obeserve a significant speedup. At lease with some quick benchmarking. Nevertheless thanks for suggesting! I highly appreaciate feedback of that kind!

Cheers,
Robert


#7

Hey @biovoxxel,

for a single filter, speedup of 50% sounds already great! However, you may increase speedup/benefit from using OpenCL by introducing whole pipelines on the GPU. The reason is that copying the data back and forth the GPU takes time. It really pays of if you do a pipeline of 4-5 operations in a row, for example as in backgroundSubtraction.ijm.

Thanks again for taking the time and testing the library! If you would mind sending me the “various error messages”, I can take a look at them.

Cheers,
Robert


#8

Hello Robert -

Yes, it’s certainly believable that the compiler makes this optimization.

I was poking around on the internet. (Yes, I know, I know …) Anyway,
I didn’t see anything definitive, but there does seem to be some chatter
that reading from the gpu can be (but it depends …) slower than writing
to it. See, for example Data Transfer Matters for GPU Computing,
especially Fig. 1. There’s also this stackoverflow discussion,
CUDA host to device transfer faster than device to host transfer.

I don’t know do what extent it wold be worth trying to further track
this down, but it might be informative to time just the calls to

    pClearCLImage.copyTo(buffer, true);
    buffer.writeTo(contOut, true);

and

    lClearClImage.readFrom(inputArray, true);

One could also dig down into the ClearCL code, and time the
analogous calls into OpenCL and see if the asymmetry persists
there.

Thanks, mm


#9

Cool stuff. thank you. I started to test it with my images on a fresh download of Fiji, but soon ran into problems. The blur3D often fails on my 512x512x512 image cube. It seamed random, so I started to test with the “T1 Head” example converted to 32-bit. It runs fine for some times, but starts to fail with the 13th run at the GeForce Card:

java.lang.RuntimeException: clearcl.exceptions.OpenCLException: OpenCL error: -36 -> CL_INVALID_COMMAND_QUEUE
java.lang.RuntimeException: clearcl.exceptions.OpenCLException: OpenCL error: -5 -> CL_OUT_OF_RESOURCES
clearcl.exceptions.OpenCLException: OpenCL error: -4 -> CL_MEM_OBJECT_ALLOCATION_FAILURE

if I exit and restart Fiji it works again. So a memory clearing bug is suspected, see also the screenshot from task manager:

My mashine (Win10 64bit, 32GB RAM) has 3 OpenCL devices

  • GeForce GTX 1060 3GB: (dedicated GPU Speicher 3GB, 1.5s per loop) failed at 13th loop
  • Intel® Core™ i7-7700 CPU @ 3.60GHz: (GPU Speicher 15,9GB, 7.5s per loop) failed at 44th loop
  • Intel® HD Graphics 630: (GPU Speicher 15,9GB, 4.4s per loop) failed at 44th loop

My usecase will be on x-ray tomography data (typically 512³ to 2048³ voxel).


#10

Hi @mountain_man

I’m on it. Follow the commits in this branch:

A factor of about 50% speedup in transfer CPU-GPU (and back) is apparently possible. I’ll keep you posted as soon as there is an updated version on the Update Site. Likely soon


#11

Hey @Jurgen_Gluch,

thanks for the feedback! I just fixed a memory leak an hour ago! I will check for more like this and let you know as soon as there is an updated version available for testing.

Again, feedback like this is very much appreciated! Thanks a lot!

Cheers,
Robert


#12

@haesleinhuepf, Kudos (or should I say CUDAs?) to you for putting efforts into this! Greatly appreciated and needed. I often moan about the fact that so many fundamental functions in imageJ aren’t even multithreaded (eg try running an Histogram on a 25Gb stack and look at CPU usage)…but this is a whole new level! Will give it a go asap…


#13

Thanks for the Kudos (We also run on non-NVidia, that’s why CUDAs is inappropriate :wink:)

Wait some days,… the next pre-alpha release (in about 3-4 days) will be worth to wait for :sunglasses:


#14

looking forward to it. In the meantime I’ll be searching for a ~50Gb RAM GPU card…
:wink:


#15

Hey @Gaby, @Jurgen_Gluch, @mountain_man and @biovoxxel,

thanks again for testing! I just put a new version on the update site. It contains

  • transfer from/to GPU should be 20% faster now
  • if you use filters where input and output image have the same size and type, the output-image doesn’t need to be pushed to the GPU anymore before doing the processing. It will be created there automatically. Just pull the output image afterwards. View the changes to an example macro here.
  • under-the-hood bugfixes, such as a memory leak and type conversion issues (thresholding an 8-bit image lead to weird results for example)

I just tested it on Intel UHD 620, Nvidia TITAN xP and AMD Ryzen Vega 3 and it works. In case there is new trouble with other GPUs, please let me know. Again, your testing efforts are highly appreciated as this project lives from it!

Unless there are critical issues which prevent you from playing with the tools, I would step back for a moment and wait for

  • more feedback e.g. on applicability, API-convenience, GPU-compatibility and
  • a wish-list for functionality you guys would like to run on the GPU which is not there yet. I cannot make any promises but I’d like to know what you think would be good to have.

Furthermore, the push-pull code looks a bit ugly under the hood and I need to refactor it before extending it further. Expect an alpha-release around Christmas.

Thanks again everyone for testing! I really appreciate your efforts!

Cheers,
Robert