How to write a fast gemm by opencl

any one is aware of opencl? How can I write a fast gemm (matrix multiply)? I found is very fast, and I did not find an opencl implements fast enough.

Hi @yxdragon,

yes, some of us are working on libraries using opencl. :wink:

In the python universe there is pyopencl which enables you doing fast matrix multiplications, for example using the reikna library. You may also want to take a look at gputools by @mweigert .

In the ImageJ universe, there is #clij. And the upcoming clij2 has a method for matrix muliplication :wink:

For learning OpenCL in the context of gemm, I recommend this resource:


@haesleinhuepf Thanks for the answer.

In fact I had read much document, include


and many other tutorial, (And I think I have a very good learning ability), even I had test them in different device, different os, (intel gpu, nvidia, on linux, windows, mac). But I did not got a good performance.

I follow the do step by step. and the nivia version, gemm1 is 10 times slower than the what more, the gemm2, is 2 times slower than the nivia gemm1, and the gemm3 is 30 times slower than the gemm2 :joy:
(I use pyopencl, I believe the cnugteren’s tutorial is real, So I did not know it is the matter of pyopencl? or opencl is not fit for modern GPU?) now I am very confused, Can you give me some demo code, and d a compair with

(I am writing a cnn framework,, only based on numpy, and now it is faster than torch-cpu, it is easy to install) for getting a better performance, and also with a general fit no-nvidia GPU, I want to use opencl to implements the layers again. but I had spend several days on the dot. could you help me have a try?

1 Like

Wow, this looks awesome!

I thought you have that already? I can take a look at your code if you point me at it :wink:

Just my two cents, you may know already because you read a lot:

  • When working with GPUs, you should try to minimize data transfer between RAM and GPU-RAM. If a program on the CPU takes one second and on the GPU is a thousand times faster, you won’t measure that if the data transfer takes two seconds :wink: In clij, the commands for that transfer are called push and pull. And you should minimize their usage.
  • If the amount of memory you are processing is very small, it fits in the cache of the CPU. In thus case the CPU will outperform the GPU, because cache can be very fast.
  • When you do timing measurements, do them several times. The first execution on a GPU (in Java on the CPU as well) is slower because of the just-in-time compilation of code.

I put some more hints in the FAQ of #clij:

Does this help?


dot_step.txt (4.9 KB)
here, I follow, but the result is slower and slower.

And I had test many other implemets, no one is better then the nivia one.

Just a quick question about your code: What’s different between “opencl 1 time”, “opencl 2 time” and “opencl 3 time”… It looks like you are comparing different operations on GPU and CPU with each other…

One more question. I’m trying to run your code and get an error message:

pyopencl._cl.RuntimeError: clBuildProgram failed: BUILD_PROGRAM_FAILURE - clBuildProgram failed: BUILD_PROGRAM_FAILURE - clBuildProgram failed: BUILD_PROGRAM_FAILURE

Build on <pyopencl.Device 'GeForce RTX 2060 SUPER' on 'NVIDIA CUDA' at 0x1fe8c367780>:

<kernel>:12:27: error: variable length arrays are not supported in OpenCL
    __local float Asub[TS][TS];

What hardware are you running on? This could be a hint towards lack of performance…
I follow this tutorial, teach you to write a nivia one, and upgrade it step by step, to make it faster.
1 time, 2 time, 3 time is the nivia geem1, and upgrade gemm2, gemm3, (there are 8 upgrade step, but I did not try go on, for the result is going slower…)

1 Like

for it is from C code, So I trans the #define to int; I think you can replace TS by 16

I just wrote a little example using clesperanto which is built on top of gputools and pyopencl:

It measures the time for the data transfer:

# push data to GPU
start = time()
gpu_a = push(a_np)
gpu_b = push(b_np)

# allocate memory for result on GPU
gpu_c = create((1024, 1024))
print('push+alloc time', (time() - start) * 1000)

Then, it executes the operation twice on GPU:

# multiply matrix onGPU
start = time()
multiplyMatrix(gpu_a, gpu_b, gpu_c)
print('cle1 time', (time() - start) * 1000)

start = time()
multiplyMatrix(gpu_a, gpu_b, gpu_c)
print('cle2 time', (time() - start) * 1000)

and twice on the CPU:

# multiply matrix on CPU
start = time(), b_np)
print('np1 time', (time() - start) * 1000)

start = time(), b_np)
print('np2 time', (time() - start) * 1000)

The output is:

push+alloc time 12.990236282348633
cle1 time 31.94737434387207
cle2 time 4.979372024536133
np1 time 16.96324348449707
np2 time 14.928102493286133

As you can see, the GPU/opencl implementation becomes faster in the second run. Furthermore, pushing the arrays to the GPU already takes as much time as the operation on the CPU. I conclude, you can only exploit this efficiently if you reduce push/pull commands and try to implement the whole program in opencl and not just a matrix multiplication. Note that my matrix multiplication in OpenCL is a naive implementation and has no fancy optimization. Optimization in the opencl kernel code cannot improve push or allocation time :wink:

very exciting! In fact I just want a fast gemm (I had complement other cnn layers). I will have a try, Thanks very much. may be some other questions later.

1 Like

Have you tried CLBlast?

When running a similar benchmark as @haesleinhuepf with a 4096^2 matrix and pyclblast.gemm I get the following:

numpy       253.47 ms
cle1        197.35 ms
cle2        174.06 ms
pyclblast    17.96 ms



1 Like

Thanks @haesleinhuepf, @mweigert. I find that is the matter of my gpu. It seems that OpenCL only has a better performance on Nvidia and Amd. On no-isolate GPU works bad.

And now I aimed at ClPy and cupy. I upgrade my cnn framework to Planer, Implemented in pure Numpy-like interface, but can change backend into cupy or clpy. Now it is fast enough, has a equal performance with torch gpu on windows. (but torch is faster on Linux)