GPU acceleration works only on first run (Stardist, jupyter notebook)

Good morning,

I’m stuck with my Stardist installation.

Installation
In anaconda 4.9.2, I created a stardist environment with
python=3.8
tensorflow-gpu=2.3.0.
CUDA toolkit 10.1 (no pyopencl/gputools installation). Paths are set in the environment variables

Problem
The GPU is used on the fist run of the notebook (stardist2D and stardist3D notebooks from GitHub - stardist/stardist: StarDist - Object Detection with Star-convex Shapes). Then for subsequent runs the GPU is not used or i got the following error after the kernel restart:
“InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version”

Rebooting doesn’t help. Re-installing CUDA Toolkit sorted the problem but that’s not a viable solution. Obviously I d1on’t know what I’m doing here. Any one with any suggestion?

Many thanks

Bertrand

It is not quite clear what you do between runs.
Do you restart the notebook kernel?

If not, you may still have allocated all GPU memory and by executing the cells again you will try and allocate memory again.

A quick way to check is to run nvidia-smi on the command line to check how much GPU memory is already allocated and how much is available.

It is also not quite clear on which notebook that is happening, as you just linked to the repo?

Maybe you are trying to run the prediction notebook while the training notebook is still active and has blocked your GPU resources?

1 Like

Hi Volker

Thank you for the suggestion. I ended up doing a clean re-installation of anaconda as the issues where clearly related to the use of anaconda. It fixed the problem. I will follow your advice about monitoring the GPU activity.

Thank you

Bertrand

Spoken to fast. Got an error again after a GPU sync error while training the 3D model. Since then no luck getting the GPU working again via the notebook. I put the logs and errors in case someone encountered a similar issue.

Here are the logs

** Anaconda prompt // don’t see any error **
(base) C:\Users\vernayb\Documents\notebooks>jupyter notebook
[I 10:14:58.935 NotebookApp] JupyterLab extension loaded from C:\ProgramData\Anaconda3\lib\site-packages\jupyterlab
[I 10:14:58.935 NotebookApp] JupyterLab application directory is C:\ProgramData\Anaconda3\share\jupyter\lab
[I 10:14:58.935 NotebookApp] Serving notebooks from local directory: C:\Users\vernayb\Documents\notebooks
[I 10:14:58.935 NotebookApp] Jupyter Notebook 6.1.4 is running at:
[I 10:14:58.935 NotebookApp] http://localhost:8888/?token=6487ae58b45c534532a40331785c8776e3c409af264f8edf
[I 10:14:58.935 NotebookApp] or http://127.0.0.1:8888/?token=6487ae58b45c534532a40331785c8776e3c409af264f8edf
[I 10:14:58.935 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 10:14:58.983 NotebookApp]

To access the notebook, open this file in a browser:
    file:///C:/Users/vernayb/AppData/Roaming/jupyter/runtime/nbserver-8904-open.html
Or copy and paste one of these URLs:
    http://localhost:8888/?token=6487ae58b45c534532a40331785c8776e3c409af264f8edf
 or http://127.0.0.1:8888/?token=6487ae58b45c534532a40331785c8776e3c409af264f8edf

[I 10:15:11.513 NotebookApp] Kernel started: 5ce00024-3883-461d-a2e8-be779143773a, name: stardist
2021-03-29 10:15:40.423053: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
2021-03-29 10:16:40.500329: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library nvcuda.dll
2021-03-29 10:16:40.754438: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:2d:00.0 name: Quadro P6000 computeCapability: 6.1
coreClock: 1.645GHz coreCount: 30 deviceMemorySize: 24.00GiB deviceMemoryBandwidth: 403.49GiB/s
2021-03-29 10:16:40.758987: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
2021-03-29 10:16:40.788625: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
2021-03-29 10:16:40.817048: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cufft64_10.dll
2021-03-29 10:16:40.827086: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library curand64_10.dll
2021-03-29 10:16:40.859050: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cusolver64_10.dll
2021-03-29 10:16:40.876279: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cusparse64_10.dll
2021-03-29 10:16:40.934856: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2021-03-29 10:16:40.939522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-03-29 10:16:40.942428: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-29 10:16:41.556520: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x289619e4fe0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-29 10:16:41.560114: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-03-29 10:16:41.576916: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:2d:00.0 name: Quadro P6000 computeCapability: 6.1
coreClock: 1.645GHz coreCount: 30 deviceMemorySize: 24.00GiB deviceMemoryBandwidth: 403.49GiB/s
2021-03-29 10:16:41.582771: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
2021-03-29 10:16:41.585831: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
2021-03-29 10:16:41.588874: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cufft64_10.dll
2021-03-29 10:16:41.591892: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library curand64_10.dll
2021-03-29 10:16:41.594917: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cusolver64_10.dll
2021-03-29 10:16:41.598997: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cusparse64_10.dll
2021-03-29 10:16:41.602082: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2021-03-29 10:16:41.606680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
[I 10:17:11.427 NotebookApp] Saving file at /stardist2D_2_training.ipynb

** Juypter notebook **

when running:
model = StarDist2D(conf, name=‘stardist’, basedir=‘models’)

Got the error:

InternalError Traceback (most recent call last)
in
----> 1 model = StarDist2D(conf, name=‘stardist’, basedir=‘models’)

~.conda\envs\stardist\lib\site-packages\stardist\models\model2d.py in init(self, config, name, basedir)
253 def init(self, config=Config2D(), name=None, basedir=’.’):
254 “”“See class docstring.”""
→ 255 super().init(config, name=name, basedir=basedir)
256
257

~.conda\envs\stardist\lib\site-packages\stardist\models\base.py in init(self, config, name, basedir)
157
158 def init(self, config, name=None, basedir=’.’):
→ 159 super().init(config=config, name=name, basedir=basedir)
160 threshs = dict(prob=None, nms=None)
161 if basedir is not None:

~.conda\envs\stardist\lib\site-packages\csbdeep\models\base_model.py in init(self, config, name, basedir)
109 self._update_and_check_config()
110 self._model_prepared = False
→ 111 self.keras_model = self._build()
112 if config is None:
113 self._find_and_load_weights()

~.conda\envs\stardist\lib\site-packages\stardist\models\model2d.py in _build(self)
269 pooled *= pool
270 for _ in range(self.config.unet_n_conv_per_depth):
→ 271 pooled_img = Conv2D(self.config.unet_n_filter_base, self.config.unet_kernel_size,
272 padding=‘same’, activation=self.config.unet_activation)(pooled_img)
273 pooled_img = MaxPooling2D(pool)(pooled_img)

~.conda\envs\stardist\lib\site-packages\tensorflow\python\keras\engine\base_layer.py in call(self, *args, **kwargs)
923 # >> model = tf.keras.Model(inputs, outputs)
924 if _in_functional_construction_mode(self, inputs, args, kwargs, input_list):
→ 925 return self._functional_construction_call(inputs, args, kwargs,
926 input_list)
927

~.conda\envs\stardist\lib\site-packages\tensorflow\python\keras\engine\base_layer.py in _functional_construction_call(self, inputs, args, kwargs, input_list)
1096 # Build layer if applicable (if the build method has been
1097 # overridden).
→ 1098 self._maybe_build(inputs)
1099 cast_inputs = self._maybe_cast_inputs(inputs, input_list)
1100

~.conda\envs\stardist\lib\site-packages\tensorflow\python\keras\engine\base_layer.py in _maybe_build(self, inputs)
2641 # operations.
2642 with tf_utils.maybe_init_scope(self):
→ 2643 self.build(input_shapes) # pylint:disable=not-callable
2644 # We must set also ensure that the layer is marked as built, and the build
2645 # shape is stored since user defined build functions may not be calling

~.conda\envs\stardist\lib\site-packages\tensorflow\python\keras\layers\convolutional.py in build(self, input_shape)
195 self.filters)
196
→ 197 self.kernel = self.add_weight(
198 name=‘kernel’,
199 shape=kernel_shape,

~.conda\envs\stardist\lib\site-packages\tensorflow\python\keras\engine\base_layer.py in add_weight(self, name, shape, dtype, initializer, regularizer, trainable, constraint, partitioner, use_resource, synchronization, aggregation, **kwargs)
595 caching_device = None
596
→ 597 variable = self._add_variable_with_custom_getter(
598 name=name,
599 shape=shape,

~.conda\envs\stardist\lib\site-packages\tensorflow\python\training\tracking\base.py in _add_variable_with_custom_getter(self, name, shape, dtype, initializer, getter, overwrite, **kwargs_for_getter)
743 initializer = checkpoint_initializer
744 shape = None
→ 745 new_variable = getter(
746 name=name,
747 shape=shape,

~.conda\envs\stardist\lib\site-packages\tensorflow\python\keras\engine\base_layer_utils.py in make_variable(name, shape, dtype, initializer, trainable, caching_device, validate_shape, constraint, use_resource, collections, synchronization, aggregation, partitioner)
131 # can remove the V1.
132 variable_shape = tensor_shape.TensorShape(shape)
→ 133 return tf_variables.VariableV1(
134 initial_value=init_val,
135 name=name,

~.conda\envs\stardist\lib\site-packages\tensorflow\python\ops\variables.py in call(cls, *args, **kwargs)
258 def call(cls, *args, **kwargs):
259 if cls is VariableV1:
→ 260 return cls._variable_v1_call(*args, **kwargs)
261 elif cls is Variable:
262 return cls._variable_v2_call(*args, **kwargs)

~.conda\envs\stardist\lib\site-packages\tensorflow\python\ops\variables.py in _variable_v1_call(cls, initial_value, trainable, collections, validate_shape, caching_device, name, variable_def, dtype, expected_shape, import_scope, constraint, use_resource, synchronization, aggregation, shape)
204 if aggregation is None:
205 aggregation = VariableAggregation.NONE
→ 206 return previous_getter(
207 initial_value=initial_value,
208 trainable=trainable,

~.conda\envs\stardist\lib\site-packages\tensorflow\python\ops\variables.py in (**kwargs)
197 shape=None):
198 “”“Call on Variable class. Useful to force the signature.”""
→ 199 previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
200 for _, getter in ops.get_default_graph()._variable_creator_stack: # pylint: disable=protected-access
201 previous_getter = _make_getter(getter, previous_getter)

~.conda\envs\stardist\lib\site-packages\tensorflow\python\ops\variable_scope.py in default_variable_creator(next_creator, **kwargs)
2581 if use_resource:
2582 distribute_strategy = kwargs.get(“distribute_strategy”, None)
→ 2583 return resource_variable_ops.ResourceVariable(
2584 initial_value=initial_value,
2585 trainable=trainable,

~.conda\envs\stardist\lib\site-packages\tensorflow\python\ops\variables.py in call(cls, *args, **kwargs)
262 return cls._variable_v2_call(*args, **kwargs)
263 else:
→ 264 return super(VariableMetaclass, cls).call(*args, **kwargs)
265
266

~.conda\envs\stardist\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py in init(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, variable_def, import_scope, constraint, distribute_strategy, synchronization, aggregation, shape)
1505 self._init_from_proto(variable_def, import_scope=import_scope)
1506 else:
→ 1507 self._init_from_args(
1508 initial_value=initial_value,
1509 trainable=trainable,

~.conda\envs\stardist\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py in _init_from_args(self, initial_value, trainable, collections, caching_device, name, dtype, constraint, synchronization, aggregation, distribute_strategy, shape)
1649 with ops.name_scope(“Initializer”), device_context_manager(None):
1650 initial_value = ops.convert_to_tensor(
→ 1651 initial_value() if init_from_fn else initial_value,
1652 name=“initial_value”, dtype=dtype)
1653 if shape is not None:

~.conda\envs\stardist\lib\site-packages\tensorflow\python\keras\initializers\initializers_v2.py in call(self, shape, dtype)
395 (via tf.keras.backend.set_floatx(float_dtype))
396 “”"
→ 397 return super(VarianceScaling, self).call(shape, dtype=_get_dtype(dtype))
398
399

~.conda\envs\stardist\lib\site-packages\tensorflow\python\ops\init_ops_v2.py in call(self, shape, dtype)
559 else:
560 limit = math.sqrt(3.0 * scale)
→ 561 return self._random_generator.random_uniform(shape, -limit, limit, dtype)
562
563 def get_config(self):

~.conda\envs\stardist\lib\site-packages\tensorflow\python\ops\init_ops_v2.py in random_uniform(self, shape, minval, maxval, dtype)
1041 else:
1042 op = random_ops.random_uniform
→ 1043 return op(
1044 shape=shape, minval=minval, maxval=maxval, dtype=dtype, seed=self.seed)
1045

~.conda\envs\stardist\lib\site-packages\tensorflow\python\util\dispatch.py in wrapper(*args, **kwargs)
199 “”“Call target, and fall back on dispatchers if there is a TypeError.”""
200 try:
→ 201 return target(*args, **kwargs)
202 except (TypeError, ValueError):
203 # Note: convert_to_eager_tensor currently raises a ValueError, not a

~.conda\envs\stardist\lib\site-packages\tensorflow\python\ops\random_ops.py in random_uniform(shape, minval, maxval, dtype, seed, name)
286 maxval = 1
287 with ops.name_scope(name, “random_uniform”, [shape, minval, maxval]) as name:
→ 288 shape = tensor_util.shape_tensor(shape)
289 # In case of [0,1) floating results, minval and maxval is unused. We do an
290 # is comparison here since this is cheaper than isinstance or eq.

~.conda\envs\stardist\lib\site-packages\tensorflow\python\framework\tensor_util.py in shape_tensor(shape)
1027 # not convertible to Tensors because of mixed content.
1028 shape = tuple(map(tensor_shape.dimension_value, shape))
→ 1029 return ops.convert_to_tensor(shape, dtype=dtype, name=“shape”)
1030
1031

~.conda\envs\stardist\lib\site-packages\tensorflow\python\framework\ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
1497
1498 if ret is None:
→ 1499 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
1500
1501 if ret is NotImplemented:

~.conda\envs\stardist\lib\site-packages\tensorflow\python\framework\constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
336 as_ref=False):
337 _ = as_ref
→ 338 return constant(v, dtype=dtype, name=name)
339
340

~.conda\envs\stardist\lib\site-packages\tensorflow\python\framework\constant_op.py in constant(value, dtype, shape, name)
261 ValueError: if called on a symbolic tensor.
262 “”"
→ 263 return _constant_impl(value, dtype, shape, name, verify_shape=False,
264 allow_broadcast=True)
265

~.conda\envs\stardist\lib\site-packages\tensorflow\python\framework\constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
273 with trace.Trace(“tf.constant”):
274 return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
→ 275 return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
276
277 g = ops.get_default_graph()

~.conda\envs\stardist\lib\site-packages\tensorflow\python\framework\constant_op.py in _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
298 def _constant_eager_impl(ctx, value, dtype, shape, verify_shape):
299 “”“Implementation of eager constant.”""
→ 300 t = convert_to_eager_tensor(value, ctx, dtype)
301 if shape is None:
302 return t

~.conda\envs\stardist\lib\site-packages\tensorflow\python\framework\constant_op.py in convert_to_eager_tensor(value, ctx, dtype)
95 except AttributeError:
96 dtype = dtypes.as_dtype(dtype).as_datatype_enum
—> 97 ctx.ensure_initialized()
98 return ops.EagerTensor(value, ctx.device_name, dtype)
99

~.conda\envs\stardist\lib\site-packages\tensorflow\python\eager\context.py in ensure_initialized(self)
537 if self._use_tfrt is not None:
538 pywrap_tfe.TFE_ContextOptionsSetTfrt(opts, self._use_tfrt)
→ 539 context_handle = pywrap_tfe.TFE_NewContext(opts)
540 finally:
541 pywrap_tfe.TFE_DeleteContextOptions(opts)

InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

** test notebook (tensorflow and GPU) **
import tensorflow as tf
print("TensorFlow version: "+ tf.version)
OUTPUT = TensorFlow version: 2.3.0

tf.config.list_physical_devices()
OUTPUT = [PhysicalDevice(name=’/physical_device:CPU:0’, device_type=‘CPU’),
PhysicalDevice(name=’/physical_device:XLA_CPU:0’, device_type=‘XLA_CPU’),
PhysicalDevice(name=’/physical_device:GPU:0’, device_type=‘GPU’),
PhysicalDevice(name=’/physical_device:XLA_GPU:0’, device_type=‘XLA_GPU’)]

Hello Bertrand!

Did you solve this problem?
I think I experienced a related problem yesterday, but not with Jupyter but just with plain Python (in PyCharm). So I got different error messages, but also memory related. I gathered it was a problem with the memory growth. I was going nuts for hours until early in the morning to solve it, because it worked previously. I thought I accidentially broke my conda environment because I was experimenting in with another CNN in other environments. I even went so far to completely erase Anaconda. But nothing worked.
But I read somewhere that newer Nvidia drivers can cause problems with tensorflow-gpu. So I reinstalled the driver just a few minutes ago, and it works now again :slight_smile:

TLDR: Install the newest Nvidia driver for your GPU (or maybe the Studio edition of the driver).

Best wishes,
Christian

Hi Christian,

I haven’t again experienced the problem since I didn’t run into GPU sync error again. The driver was the most recent at the time, I’ll check if there is a newer version and update.

Thanks for the tip.

Bertrand