Issues running DL on grid v100 GPU

Hi,
I’m trying to set up a server for image analysis by deep learning at my institute. I want it to run all the popular DL stuff - CARE, U-Nets etc. Computing have set me up with a VM with access to a grid v100 GPU. For the GPU to work on the VM, I have to use the same driver version as the host server (442.06) in this case. I’ve had no joy at all getting cuda working. Does anyone have ideas? I suspect it might be that tesnor flow 1.x that most image analysis software uses does not support the CUDA version of my driver? I’ve tried many different cuda toolkit/cuDNN version. Does anyone have any ideas? More troubleshooting information below…

image

When I run…

from tensorflow.python.client import device_lib

device_lib.list_local_devices()

I get

2020-09-29 14:50:46.880025: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll

device_lib.list_local_devices()

2020-09-29 14:50:55.491271: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2

2020-09-29 14:50:55.496631: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll

2020-09-29 14:50:55.536935: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

2020-09-29 14:50:55.541056: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: Y95-JHAL-W-V

2020-09-29 14:50:55.541661: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: Y95-JHAL-W-V

Many thanks for your help!

I don’t know anything at all about the special situation of using a GPU in a VM.

Since you’re using conda, you can try creating a new environment and installing a working combination of tensorflow, cuda, and cudnn in it. For example, try that:

conda env create -f https://raw.githubusercontent.com/CSBDeep/CSBDeep/master/extras/environment-gpu-py3.7-tf2.3.yml

 

environment-gpu-py3.7-tf2.3.yml will install these packages:

name: csbdeep
channels:
  - defaults
  - nvidia
dependencies:
  - python=3.7
  - cudatoolkit=10.1.*
  - cudnn=7.6.*
  - jupyter
  - pip
  - pip:
    - tensorflow==2.3.*
    - csbdeep

Hi @uschmidt83,
Thanks so much for getting back to me! That was very useful. I’ve lost track of how many different versions of cundnn and tensorfow I’ve downloaded from the nvidia website to try and get this working.

I’ve installed the conda environment from your link, but still getting

2020-12-07 13:17:04.033367: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library nvcuda.dll
2020-12-07 13:17:04.080879: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

Any other ideas? Is it possible its not compatible with the CUDA version for the driver (10.2)?

Thanks,
John

Hi @uschmidt83,
Solved! Tried your environment on linux instead of windows on the same VM hardware and it worked with no issues. I’m putting it down to an issue with the grid GPU and windows.

Thanks again for your help. That environment was very useful.

John

1 Like