Train_network issue

Hi all, I’m attempting to use DLC to track the movement of a basketball approaching a hoop. I have two synced cameras set up - one below and one to the side of the hoop. I manually select and label the frames (e.g. frontHoop, backHoop, ball) and create a training dataset, but when I go to train the network nothing happens. Specifically, I get the message “Starting with standard pose-dataset loader.” (and then some advisory messages about upgrading tensor flow at some point in the future), then the fan on my pc fires up and a GPU process appears, but training doesn’t start. I leave it for 30+ mins and nothing happens.

I’ve used this exact same processing pipeline to track adult human body position and the flight of a ball, and train_network has always worked just fine, so I don’t think it’s a coding error. I’m running Ubuntu 18.4 with an RTX2070 GPU and 64 GB ram, so plenty of processing power. Does anyone have any suggestions? The videos are fairly low resolution compared to videos I’ve used successfully in the past, and of course there are no humans present in the videos so I’m technically not tracking any pose information, so perhaps this has something to do with it.

Here’s an example video:

And here’s the output when I run train_network:

Starting with standard pose-dataset loader.
WARNING:tensorflow:From /home/koala/anaconda3/envs/dlc-ubuntu-GPU/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/koala/anaconda3/envs/dlc-ubuntu-GPU/lib/python3.6/site-packages/tensorflow/python/ops/losses/losses_impl.py:209: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.

Update: I tried converting the video from mp4 format to avi format, and I ran through the whole processing pipeline again and now train_network is working. My guess is that there’s something weird about the format of the original video file that was causing problems.

can you set displayiters=1 when you run deeplabcut.train_ne to see if it starts training? It might be that somehow it’s just using the CPU, this to hit 1000 iterations would take more than an hour…

deeplabcut.train_network(config, shuffle=1, trainingsetindex=0, gputouse=None, max_snapshots_to_keep=5, autotune=False, displayiters=None, saveiters=None, maxiters=None)

Hi, thank you for getting back to me. I tried setting displayeriters=1, but when I encounter this issue, train network doesn’t get to the stage where it starts producing an output. When I run this in a jupyter notebook, the only output I get is what I posted above. I just encountered the same thing with a video from another project (this video is in .avi format), so it seems like the problem isn’t just with the previous video.

When this happens, I notice a GUI process opens, but the GPU memory usage is far lower (95 mb) than GPU usage typically is when running train_network.

Train_Network_Output.txt (3.0 KB)

Oh, don’t import tensorflow and start a session separately- that will allocate your gpu elsewhere!

Ah right, yes it doesn’t make sense to start a separate tf session. However, I still find that I need to import tensorflow and set allow_growth=True, otherwise any DLC processes that require the GPU simply won’t run on my machine.

import tensorflow as tf
tf.version
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

If I don’t include these lines, I get a bunch of cuDNN errors (e.g. see below). But if i do include them, then GPU intensive DLC processes work just fine…with the exception of certain video files, like the example above. It’s very strange…

Duration of video [s]: 204.07 , recorded with 30.0 fps!
Overall # of frames: 6122 found with (before cropping) frame dimensions: 1920 1080
Starting to extract posture
2019-09-24 21:08:39.938055: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-09-24 21:08:39.939601: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node resnet_v1_50/conv1/Conv2D (defined at /home/koala/anaconda3/envs/dlc-ubuntu-GPU/lib/python3.6/site-packages/deeplabcut/pose_estimation_tensorflow/nnet/pose_net.py:67) ]]

Sounds like the ResNet weights are not found?

Perhaps. Is there a way that I can verify this? I always run “create_training_dataset” (and receive confirmation that the training dataset was successfully created) before attempting to train the network. I’m at a bit of a loss - today I ran two videos recorded on the same model of camera, same resolution etc. through my pipeline, and train_network will happily run for one project but not the other.