Clesperanto 'getMemory'

Hi @haesleinhuepf

I was just playing with clesperanto. In fact my ultimate goal is to try to replicate the workflow you recently posted here

However I wanted to first implement something similar without lazy loading just to get an idea how everything worked, and how much memory each part was taking.

I couldn’t find the clesperanto equivalent of reportMemory any advice on how to get a memory report with clesperanto?

Thanks

1 Like

Hey Brian @bnorthan ,

Good question! This method doesn’t exist (yet) because we would need to maintain an internal list of CL-mem objects. Such a list would break pyopencl`s garbage collection mechanism. I was happy to see that pyopencl is able to clean up memory itself and I wanted to see (mid-term) how well it does it before I break that feature :wink:

The workflow you’re re-implementing exists a bit better documented also here. Furthermore, this exercise (and its solution) will be of your interest, I’m sure :wink:

Let me know if you have any questions!

Cheers,
Robert

1 Like

Hi @haesleinhuepf

So I tried to run the below code that re-uses your background_subtraction and spot_detection in Napari with a 0.25G volume on a 6G card. It failed on the spot_detection step with a memory allocation error. Seems to me that Napari is using GPU memory and the algorithms are declaring temp memory so it may make sense that it failed, still I was curious if there a way to see the underlying memory use. I also tried using cuda to get memory use, but that failed with a driver problem. I may explore in Java for a while where it is easier to get the memory report. I saw there is also a lazy loading example using CLIJ and imglib2 so I am going to try and play with that one. End goal is to do spot counting and other operations on big data.

gpu_im = cle.push_zyx(im)
gpu_bs = cle.create(gpu_im)
background_subtraction(gpu_im, gpu_bs)
spot_detection(gpu_bs, gpu_im, 6000)
2 Likes

Hey Brian @bnorthan ,

these aspects are also very interesting for me as I’m now running first actual real use-case projects with pyclesperanto and plan to deliver scripts to collaborators. Thus, I’m happy to hear any new insight your find!

I think so too, VisPy appears to render images using opengl. Thus, I’d acutally like to learn how to pipe OpenCL-mem objects into napari, also in order to spare some copy operations…

Just for debugging the out-of-memory issue you mentioned, would you mind executing these lines and posting their output?

import pyclesperanto_prototype as cle

# print out CL Info to see available devices
print(cle.cl_info())

# select a good default device
cle.select_device("RTX")

# print out chosen device
print(cle.get_device())
1 Like
print(cle.cl_info())
NVIDIA CUDA
EXTENSIONS:cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics
HOST_TIMER_RESOLUTION:None
NAME:NVIDIA CUDA
PROFILE:FULL_PROFILE
VENDOR:NVIDIA Corporation
VERSION:OpenCL 1.2 CUDA 10.2.185


    GeForce RTX 2060
       ADDRESS_BITS:64
       ATTRIBUTE_ASYNC_ENGINE_COUNT_NV:3
       AVAILABLE:1
       BUILT_IN_KERNELS:
       COMPILER_AVAILABLE:1
       COMPUTE_CAPABILITY_MAJOR_NV:7
       COMPUTE_CAPABILITY_MINOR_NV:5
       DOUBLE_FP_CONFIG:63
       DRIVER_VERSION:440.100
       ENDIAN_LITTLE:1
       ERROR_CORRECTION_SUPPORT:0
       EXECUTION_CAPABILITIES:1
       EXTENSIONS:cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics
       EXT_MEM_PADDING_IN_BYTES_QCOM:None
       GLOBAL_MEM_CACHELINE_SIZE:128
       GLOBAL_MEM_CACHE_SIZE:983040
       GLOBAL_MEM_CACHE_TYPE:2
       GLOBAL_MEM_SIZE:6222839808
       GLOBAL_VARIABLE_PREFERRED_TOTAL_SIZE:None
       GPU_OVERLAP_NV:1
       HALF_FP_CONFIG:None
       HOST_UNIFIED_MEMORY:0
       IL_VERSION:None
       IMAGE2D_MAX_HEIGHT:32768
       IMAGE2D_MAX_WIDTH:32768
       IMAGE3D_MAX_DEPTH:16384
       IMAGE3D_MAX_HEIGHT:16384
       IMAGE3D_MAX_WIDTH:16384
       IMAGE_MAX_ARRAY_SIZE:2048
       IMAGE_MAX_BUFFER_SIZE:268435456
       IMAGE_SUPPORT:1
       INTEGRATED_MEMORY_NV:0
       KERNEL_EXEC_TIMEOUT_NV:1
       LINKER_AVAILABLE:1
       LOCAL_MEM_SIZE:49152
       LOCAL_MEM_TYPE:1
       MAX_CLOCK_FREQUENCY:1200
       MAX_COMPUTE_UNITS:30
       MAX_CONSTANT_ARGS:9
       MAX_CONSTANT_BUFFER_SIZE:65536
       MAX_GLOBAL_VARIABLE_SIZE:None
       MAX_MEM_ALLOC_SIZE:1555709952
       MAX_NUM_SUB_GROUPS:None
       MAX_ON_DEVICE_EVENTS:2048
       MAX_ON_DEVICE_QUEUES:4
       MAX_PARAMETER_SIZE:4352
       MAX_PIPE_ARGS:None
       MAX_READ_IMAGE_ARGS:256
       MAX_READ_WRITE_IMAGE_ARGS:None
       MAX_SAMPLERS:32
       MAX_WORK_GROUP_SIZE:1024
       MAX_WORK_ITEM_DIMENSIONS:3
       MAX_WORK_ITEM_SIZES:[1024, 1024, 64]
       MAX_WRITE_IMAGE_ARGS:32
       MEM_BASE_ADDR_ALIGN:4096
       MIN_DATA_TYPE_ALIGN_SIZE:128
       NAME:GeForce RTX 2060
       NATIVE_VECTOR_WIDTH_CHAR:1
       NATIVE_VECTOR_WIDTH_DOUBLE:1
       NATIVE_VECTOR_WIDTH_FLOAT:1
       NATIVE_VECTOR_WIDTH_HALF:0
       NATIVE_VECTOR_WIDTH_INT:1
       NATIVE_VECTOR_WIDTH_LONG:1
       NATIVE_VECTOR_WIDTH_SHORT:1
       OPENCL_C_VERSION:OpenCL C 1.2 
       PAGE_SIZE_QCOM:None
       PARTITION_AFFINITY_DOMAIN:[0]
       PARTITION_MAX_SUB_DEVICES:1
       PARTITION_PROPERTIES:[0]
       PARTITION_TYPE:[0]
       PCI_BUS_ID_NV:1
       PCI_SLOT_ID_NV:0
       PIPE_MAX_ACTIVE_RESERVATIONS:None
       PIPE_MAX_PACKET_SIZE:None
       PLATFORM:<pyopencl.Platform 'NVIDIA CUDA' at 0x5632aea2ecb0>
       PREFERRED_GLOBAL_ATOMIC_ALIGNMENT:None
       PREFERRED_INTEROP_USER_SYNC:0
       PREFERRED_LOCAL_ATOMIC_ALIGNMENT:None
       PREFERRED_PLATFORM_ATOMIC_ALIGNMENT:None
       PREFERRED_VECTOR_WIDTH_CHAR:1
       PREFERRED_VECTOR_WIDTH_DOUBLE:1
       PREFERRED_VECTOR_WIDTH_FLOAT:1
       PREFERRED_VECTOR_WIDTH_HALF:0
       PREFERRED_VECTOR_WIDTH_INT:1
       PREFERRED_VECTOR_WIDTH_LONG:1
       PREFERRED_VECTOR_WIDTH_SHORT:1
       PRINTF_BUFFER_SIZE:None
       PROFILE:FULL_PROFILE
       PROFILING_TIMER_OFFSET_AMD:None
       PROFILING_TIMER_RESOLUTION:1000
       QUEUE_ON_DEVICE_MAX_SIZE:262144
       QUEUE_ON_DEVICE_PREFERRED_SIZE:262144
       QUEUE_ON_DEVICE_PROPERTIES:3
       QUEUE_ON_HOST_PROPERTIES:3
       QUEUE_PROPERTIES:3
       REFERENCE_COUNT:1
       REGISTERS_PER_BLOCK_NV:65536
       SINGLE_FP_CONFIG:191
       SPIR_VERSIONS:None
       SUB_GROUP_INDEPENDENT_FORWARD_PROGRESS:None
       SVM_CAPABILITIES:1
       TYPE:4
       VENDOR:NVIDIA Corporation
       VENDOR_ID:4318
       VERSION:OpenCL 1.2 CUDA
       WARP_SIZE_NV:32




Current device: GeForce RTX 2060

cle.select_device("RTX")
Out[3]: <GeForce RTX 2060 on Platform: NVIDIA CUDA (1 refs)>

print(cle.get_device())
<GeForce RTX 2060 on Platform: NVIDIA CUDA (1 refs)>
2 Likes

So…on Ubuntu the utility ‘nvidia-smi’ prints out GPU Memory usage for each process, and is useful for troubleshooting. I guess I didn’t know about this, because a lot of my GPU programming has been done in Cuda, which has a way to get GPU memory use programmatically.

It turns out Napari was using quite a bit of GPU memory. Each time you display an image, it obviously needs to use GPU image. However if I delete the image with the garbage can icon it doesn’t like it deletes the memory.

I also noticed that the cle.difference_of_gaussians seemed create some temporary memory. Is that correct?

Anyway it doesn’t appear anything is terribly wrong, you just have to be careful as you can quickly use up GPU memory during a napari session where you are viewing images, and processing them on the GPU.

Brian

1 Like

Thanks for your insights @bnorthan!

Correct!

Can you see with your debug tools if this temporary memory gets freed properly?

I will try it out anyway. Good hint, thanks!

1 Like

It looks to me like the memory is not deleted right after the call to difference_of_guassians, but neither does memory accumulate with repeated calls. I think you mentioned pyopencl has a garbage collector, so maybe it doesn’t collect until the next memory allocation or something?

1 Like