Train_network stops without error on Windows 10 machine w/GPU

OS etc: Windows 10, GeForce GTX 1080 ti, CUDA 10.0, running all DLC code in the provided DLC environment

Experiencing issue where train_network runs fine (and saves @ checkpoints) for between 1000-50000 iterations and then fails, i.e python freezes and GPU usage goes to zero without any warnings being thrown. I recognize that this is likely a hardware/computer specific issue but curious if anyone has suggestions for next steps. Is there info in the generated log files “events.out.tfevents…” that might be useful to me?

Things I’ve tried thus far:
-full restart
-Using my own data (640x480 pix @ 30 Hz) or example data (Mackenzie reaching data) led to the same result (was worried about corrupt frames)
-Changing the global_scale variable in the pos_cfg file to .6 from the .8 default did not change the observed behavior
-Verifying computer was not going into powersave or sleep modes

Here’s my NVIDIA-smi while running train_network():

| NVIDIA-SMI 441.08 Driver Version: 441.08 CUDA Version: 10.2 |
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 GeForce GTX 1080 WDDM | 00000000:03:00.0 On | N/A |
| 52% 77C P2 110W / 180W | 7076MiB / 8192MiB | 44% Default |

I am worried about this (still, as discussed on github), namely it’s CUDA 10.2 which is NOT compatible at all. TF 1.13.1 (the latest we support at this time) only runs on CUDA 10.

NVIDIA-SMI 441.08 Driver Version: 441.08 CUDA Version: 10.2 |

but maybe other have thoughts~