Training stops with no error on Windows10

A few weeks ago I reinstalled DLC using the .yaml file on the GitHub because I had issues updating to the new version. I ran some test videos through the whole process and it worked fine. This week I tried training a coworker’s labelled videos and then subsequently tried it again on my own videos because the training kept stopping. There are no errors, the GPU just goes from 5-6% usage to 0% and training completely stops. The most successful run got to 112,000 iterations before freezing, and did save snapshots at 50,000 and 100,000. When I tried restarting from the last snapshot, it froze around 25,000. In some training runs it doesn’t even make it to 10,000.

I run the anaconda prompt as an administrator and use the GUI. In response to a GitHub comment trying to solve a similar issue, I also unchecked the anaconda prompt’s “Quick Edit Mode”. On several of the training runs I have kept an eye on the CPU and memory usage, and they stay around 35% and 50% respectively the whole time.

In possibly an unrelated issue, my coworker has a very similar setup and uses Jupyter to run DLC. In her training she got to around 13,000 iterations before the kernel stopped. When she tried to start running again, it said that the cudnn was not properly initialized. I’m not sure that these are connected, but these are the issues we have both run into.

System specs: Windows10, NVIDIA GeForce GTX 1060 6GB

nvidia-smi output
Capture (2)

DLC specs from “conda list” in the environment.

python 3.6.10
cudatoolkit 10.0.130
tensorflow-gpu 1.13.1
cudnn 7.6.5

Any help is appreciated, thanks!

Here is the Github that said to uncheck the Quick Edit Mode. I don’t think it is an issue with the images because I tried it on new data and old data we had trained previously. I had also tried hitting a few keys when the GPU stopped to see if that restarted the training.

1 Like

Hey @E-Edwards - this is still an enigma to us; It would actually be really helpful for someone to document this extremely well, i.e. video recording, how the code was run, and open an issue on the tensorflow repo. If you are up for this, we would really appreciate it. We don’t use Windows to train models, so this is hard for us to rigorously test as well.

Hello, I should have time to do this, maybe not this afternoon but sometime next week for sure. My thinking is I would take videos we have used before and create a new project with them. I’ll label a handful of frames and then start training and continue recording until the GPU stops.
What would be most helpful to you?

1 Like

Hello again, I made the recording of my project creation and training. I cannot load a video into this forum though. What would be the best way to send it to you?

great! thanks!

1 Like

Okay… I’ve asked around and now heard from an extremely reputable source this is a relatively known Windows gpu driver issue (another reason to use Ubuntu ;).

I don’t know which resource has the “right” solution, but if it’s a driver version issue I would roll back to something that is known to work. We use 384 series, and never had this issue… this means using cuda9 and TensorFlow < 12. No dlc performance issues with this.

I’m sorry not to be more help, but essentially if you want to investigate more you can search for gpu stopping/driver issues. The “good” thing is isn’t not tensorflow or deeplabcut.

This solution worked. My steps were
Uninstall CUDA 10.2
Install driver 385.54
Install CUDA 9.0.176
Delete the DLC environment.
Re-create the DLC environment, changing the tensorflow-gpu in the “.yaml” file to 1.11.0

This might be unrelated, but before creating the environment I also had to copy “libssl-1_1-x64.dll” from the Anaconda/DLLS folder and paste it into the Anaconda/Library/bin folder to get past an installation error that said “the procedure entry point openssl_sk_new_reserve could not be located”
I found this solution on a stack overflow post here

After these changes I let DLC train overnight and got to 550,000 iterations this morning. Seems good to go. Now just to check if other programs we run are affected by the different CUDA version.

1 Like

i just tried this solution, but hit a snag when installing 385.54 driver - ‘The graphics driver could not find compatible graphics hardware’. I assume this is because i have a relatively new Titan RTX (26GB) in this windows 10 machine, but i had it working with CUDA10.1 and 426.00, so i tried this combo again after i noticed my driver version was no longer 426 (WHY!!!) and… it works again! not sure this tidbit will be helpful, but i noticed when i was having this weird stall issue that the GPU only ever made it to about 40%, whereas now and before the issue arose it sits at about 70% unless i overclock the GPU.

1 Like

I think I had the same issue (training was just not moving ahead after some iterations), using Windows 10 and CUDA10. Downgrading CUDA to 9 and installing an older driver seems to have worked for me (thanks @E-Edwards!).

I installed DLC-GPU changing tensorflow-gpu==1.11 and python=3.6 in the yaml file. I have a quadro P5000 and the closest I could find was driver 392 but it seemed to have worked.

1 Like