Error During Training

Hi to all,
I am facing the following error during training on a PC with NVIDIA GeForce 2080RTX GPU. I am using resnet_101 with human retrain model. and appears after 30000 iterations.
Any ideas please?
Many Thanks
Dimitrios

019-11-17 11:41:22.431074: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f7e84e0ee00 = {1, 0} LossTensor is inf or nan
---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
~/anaconda3/envs/dlc2/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1333     try:
-> 1334       return fn(*args)
   1335     except errors.OpError as e:

~/anaconda3/envs/dlc2/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1318       return self._call_tf_sessionrun(
-> 1319           options, feed_dict, fetch_list, target_list, run_metadata)
   1320 

~/anaconda3/envs/dlc2/lib/python3.6/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
   1406         self._session, options, feed_dict, fetch_list, target_list,
-> 1407         run_metadata)
   1408 

InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values
	 [[{{node train_op/CheckNumerics}}]]
	 [[{{node train_op/control_dependency}}]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
~/anaconda3/envs/dlc2/lib/python3.6/site-packages/deeplabcut/gui/train_network.py in train_network(self, event)
    244                                  displayiters=displayiters,
    245                                  saveiters=saveiters,
--> 246                                  maxiters=maxiters)
    247 
    248     def cancel_train_network(self,event):

~/anaconda3/envs/dlc2/lib/python3.6/site-packages/deeplabcut/pose_estimation_tensorflow/training.py in train_network(config, shuffle, trainingsetindex, max_snapshots_to_keep, displayiters, saveiters, maxiters, allow_growth, gputouse, autotune, keepdeconvweights)
    132           train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep,keepdeconvweights=keepdeconvweights,allow_growth=allow_growth) #pass on path and file name for pose_cfg.yaml!
    133       except BaseException as e:
--> 134           raise e
    135       finally:
    136           os.chdir(str(start_path))

~/anaconda3/envs/dlc2/lib/python3.6/site-packages/deeplabcut/pose_estimation_tensorflow/training.py in train_network(config, shuffle, trainingsetindex, max_snapshots_to_keep, displayiters, saveiters, maxiters, allow_growth, gputouse, autotune, keepdeconvweights)
    130 
    131       try:
--> 132           train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep,keepdeconvweights=keepdeconvweights,allow_growth=allow_growth) #pass on path and file name for pose_cfg.yaml!
    133       except BaseException as e:
    134           raise e

~/anaconda3/envs/dlc2/lib/python3.6/site-packages/deeplabcut/pose_estimation_tensorflow/train.py in train(config_yaml, displayiters, saveiters, maxiters, max_to_keep, keepdeconvweights, allow_growth)
    188         current_lr = lr_gen.get_lr(it)
    189         [_, loss_val, summary] = sess.run([train_op, total_loss, merged_summaries],
--> 190                                           feed_dict={learning_rate: current_lr})
    191         cum_loss += loss_val
    192         train_writer.add_summary(summary, it)

~/anaconda3/envs/dlc2/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    927     try:
    928       result = self._run(None, fetches, feed_dict, options_ptr,
--> 929                          run_metadata_ptr)
    930       if run_metadata:
    931         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~/anaconda3/envs/dlc2/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1150     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1151       results = self._do_run(handle, final_targets, final_fetches,
-> 1152                              feed_dict_tensor, options, run_metadata)
   1153     else:
   1154       results = []

~/anaconda3/envs/dlc2/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1326     if handle is None:
   1327       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1328                            run_metadata)
   1329     else:
   1330       return self._do_call(_prun_fn, handle, feeds, fetches)

~/anaconda3/envs/dlc2/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1346           pass
   1347       message = error_interpolation.interpolate(message, self._graph)
-> 1348       raise type(e)(node_def, op, message)
   1349 
   1350   def _extend_graph(self):

InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values
	 [[node train_op/CheckNumerics (defined at /home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/deeplabcut/pose_estimation_tensorflow/train.py:102) ]]
	 [[node train_op/control_dependency (defined at /home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/deeplabcut/pose_estimation_tensorflow/train.py:102) ]]

Caused by op 'train_op/CheckNumerics', defined at:
  File "/home/dimitris/anaconda3/envs/dlc2/bin/ipython", line 10, in <module>
    sys.exit(start_ipython())
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/IPython/__init__.py", line 125, in start_ipython
    return launch_new_instance(argv=argv, **kwargs)
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/traitlets/config/application.py", line 664, in launch_instance
    app.start()
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/IPython/terminal/ipapp.py", line 356, in start
    self.shell.mainloop()
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/IPython/terminal/interactiveshell.py", line 498, in mainloop
    self.interact()
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/IPython/terminal/interactiveshell.py", line 489, in interact
    self.run_cell(code, store_history=True)
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2855, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2881, in _run_cell
    return runner(coro)
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/IPython/core/async_helpers.py", line 68, in _pseudo_sync_runner
    coro.send(None)
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3058, in run_cell_async
    interactivity=interactivity, compiler=compiler, result=result)
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3249, in run_ast_nodes
    if (await self.run_code(code, result,  async_=asy)):
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-1ced4355ffc5>", line 1, in <module>
    deeplabcut.launch_dlc()
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/deeplabcut/gui/launch_script.py", line 45, in launch_dlc
    app.MainLoop()
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/wx/core.py", line 2134, in MainLoop
    rv = wx.PyApp.MainLoop(self)
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/deeplabcut/gui/train_network.py", line 246, in train_network
    maxiters=maxiters)
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/deeplabcut/pose_estimation_tensorflow/training.py", line 132, in train_network
    train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep,keepdeconvweights=keepdeconvweights,allow_growth=allow_growth) #pass on path and file name for pose_cfg.yaml!
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/deeplabcut/pose_estimation_tensorflow/train.py", line 151, in train
    learning_rate, train_op = get_optimizer(total_loss, cfg)
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/deeplabcut/pose_estimation_tensorflow/train.py", line 102, in get_optimizer
    train_op = slim.learning.create_train_op(loss_op, optimizer)
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 439, in create_train_op
    check_numerics=check_numerics)
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/tensorflow/contrib/training/python/training/training.py", line 464, in create_train_op
    'LossTensor is inf or nan')
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 919, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): LossTensor is inf or nan : Tensor had NaN values
	 [[node train_op/CheckNumerics (defined at /home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/deeplabcut/pose_estimation_tensorflow/train.py:102) ]]
	 [[node train_op/control_dependency (defined at /home/dimitris/anaconda3/envs/dlc2/lib/python3.6/site-packages/deeplabcut/pose_estimation_tensorflow/train.py:102) ]]

Hi,

That means the loss became inf/nan. Usually this is due to a corrupt image/labels etc?

Alexander

It seems that something is going wrong with the CUDNN.
Now I am getting GPU sync Failed error or CUDA_ERROR_ILLEGAL_ADDRESS
:frowning_face:

you should be sure your GPU process is stopped first, check nvidia-smi and be sure no python3 session is running, if so kill it, and start a new ipython session.

we didn’t manage to sort out the problem so we are in process to reinstall linux.
I am thinking to retry to install DLC via docker. During the first try I failed to install it properly.

If yo yard doing a fresh install, I highly recommend following this fully: https://github.com/MMathisLab/Docker4DeepLabCut2.0/wiki/Installation-of-NVIDIA-driver-and-CUDA-10 then follow the instructions for Docker on the same page,

Is there a specific reason that you install Cuda before drivers?
In a fresh install by installing first drivers and then Cuda we didn’t manage to make GPU to work. I am wondering now if there is a hardware problem.
D

just that this is the most reliable build I can install, so that is why the order of operations.