any one is aware of opencl? How can I write a fast gemm (matrix multiply)? I found np.dot is very fast, and I did not find an opencl implements fast enough.
yes, some of us are working on libraries using opencl.
For learning OpenCL in the context of gemm, I recommend this resource:
@haesleinhuepf Thanks for the answer.
In fact I had read much document, include
and many other tutorial, (And I think I have a very good learning ability), even I had test them in different device, different os, (intel gpu, nvidia, on linux, windows, mac). But I did not got a good performance.
I follow the https://cnugteren.github.io/tutorial/pages/page2.html do step by step. and the nivia version, gemm1 is 10 times slower than the np.dot. what more, the gemm2, is 2 times slower than the nivia gemm1, and the gemm3 is 30 times slower than the gemm2
(I use pyopencl, I believe the cnugteren’s tutorial is real, So I did not know it is the matter of pyopencl? or opencl is not fit for modern GPU?) now I am very confused, Can you give me some demo code, and d a compair with np.dot?
(I am writing a cnn framework, https://github.com/Image-Py/numpy-cnn, only based on numpy, and now it is faster than torch-cpu, it is easy to install) for getting a better performance, and also with a general fit no-nvidia GPU, I want to use opencl to implements the layers again. but I had spend several days on the dot. could you help me have a try?
Wow, this looks awesome!
I thought you have that already? I can take a look at your code if you point me at it
Just my two cents, you may know already because you read a lot:
- When working with GPUs, you should try to minimize data transfer between RAM and GPU-RAM. If a program on the CPU takes one second and on the GPU is a thousand times faster, you won’t measure that if the data transfer takes two seconds In clij, the commands for that transfer are called push and pull. And you should minimize their usage.
- If the amount of memory you are processing is very small, it fits in the cache of the CPU. In thus case the CPU will outperform the GPU, because cache can be very fast.
- When you do timing measurements, do them several times. The first execution on a GPU (in Java on the CPU as well) is slower because of the just-in-time compilation of code.
I put some more hints in the FAQ of #clij:
Does this help?
dot_step.txt (4.9 KB)
here, I follow https://cnugteren.github.io/tutorial/pages/page2.html, but the result is slower and slower.
And I had test many other implemets, no one is better then the nivia one.
Just a quick question about your code: What’s different between “opencl 1 time”, “opencl 2 time” and “opencl 3 time”… It looks like you are comparing different operations on GPU and CPU with each other…
One more question. I’m trying to run your code and get an error message:
pyopencl._cl.RuntimeError: clBuildProgram failed: BUILD_PROGRAM_FAILURE - clBuildProgram failed: BUILD_PROGRAM_FAILURE - clBuildProgram failed: BUILD_PROGRAM_FAILURE Build on <pyopencl.Device 'GeForce RTX 2060 SUPER' on 'NVIDIA CUDA' at 0x1fe8c367780>: <kernel>:12:27: error: variable length arrays are not supported in OpenCL __local float Asub[TS][TS];
What hardware are you running on? This could be a hint towards lack of performance…
I follow this tutorial, teach you to write a nivia one, and upgrade it step by step, to make it faster.
1 time, 2 time, 3 time is the nivia geem1, and upgrade gemm2, gemm3, (there are 8 upgrade step, but I did not try go on, for the result is going slower…)
for it is from C code, So I trans the #define to int; I think you can replace TS by 16
I just wrote a little example using clesperanto which is built on top of gputools and pyopencl:
It measures the time for the data transfer:
# push data to GPU start = time() gpu_a = push(a_np) gpu_b = push(b_np) # allocate memory for result on GPU gpu_c = create((1024, 1024)) print('push+alloc time', (time() - start) * 1000)
Then, it executes the operation twice on GPU:
# multiply matrix onGPU start = time() multiplyMatrix(gpu_a, gpu_b, gpu_c) print('cle1 time', (time() - start) * 1000) start = time() multiplyMatrix(gpu_a, gpu_b, gpu_c) print('cle2 time', (time() - start) * 1000)
and twice on the CPU:
# multiply matrix on CPU start = time() np.dot(a_np, b_np) print('np1 time', (time() - start) * 1000) start = time() np.dot(a_np, b_np) print('np2 time', (time() - start) * 1000)
The output is:
push+alloc time 12.990236282348633 cle1 time 31.94737434387207 cle2 time 4.979372024536133 np1 time 16.96324348449707 np2 time 14.928102493286133
As you can see, the GPU/opencl implementation becomes faster in the second run. Furthermore, pushing the arrays to the GPU already takes as much time as the operation on the CPU. I conclude, you can only exploit this efficiently if you reduce push/pull commands and try to implement the whole program in opencl and not just a matrix multiplication. Note that my matrix multiplication in OpenCL is a naive implementation and has no fancy optimization. Optimization in the opencl kernel code cannot improve push or allocation time
very exciting! In fact I just want a fast gemm (I had complement other cnn layers). I will have a try, Thanks very much. may be some other questions later.
Have you tried
When running a similar benchmark as @haesleinhuepf with a 4096^2 matrix and
pyclblast.gemm I get the following:
numpy 253.47 ms cle1 197.35 ms cle2 174.06 ms pyclblast 17.96 ms
And now I aimed at ClPy and cupy. I upgrade my cnn framework to Planer, Implemented in pure Numpy-like interface, but can change backend into cupy or clpy. Now it is fast enough, has a equal performance with torch gpu on windows. (but torch is faster on Linux)