DeepLabCut in Windows - training won't restart from lastest snapshot - only possible from some snapshots

My setup: Windows 10 64-bit, Conda 4.5.12, Python 3.6.8, Deeplabcut 2.0.5.1, without a GPU

I have a problem restarting training from the last snapshot. It just hangs on ‘Starting training…’ (I’ve left it for 12 hrs and nothing happens).

I’ve just had this problem resuming at snapshot-34000. I’ve tried 32000 and 30000 and so far it won’t continue training.

I’ve managed to resume training on two previous attempts by going back through the snapshots, until I find one that it works with. When resume does work, training starts fairly immediately (I’ve set display_iters: 50, so I get to see something happening at Command Prompt).

Any thoughts, anyone?

Thanks

I think that I’ve now worked this one out for myself… (!)

It seems to be a problem with the iPython, Python or Conda environment, after having stopped training with [Ctrl] + [C]. Is it somehow not letting go of resources, or leaking memory?

After [Ctrl] + [C] has been used to stop training, then other DeepLabCut commands will work as expected (analyze_videos(), create_labeled_video() etc.). But the problem that I described with restarting training occurs.

If I shut down iPython, Conda and Command Prompt and then re-open everything (activate [conda environment], ipython, import deeplabcut), continuing from my most recent snapshot IS working.

So, this is solved for me, but I’m still wondering if anyone knows why this is happening?

Thanks

Yes, so this can happen when you stop training (it sort of depends on when you hit Cntrl+C, but anyhow, you will need to free up the memory by stopping the process/TF session. This is in the close, but if you’re stopping early it’s possible https://github.com/AlexEMG/DeepLabCut/blob/efa95129061b1ba1535f7361fe76e9267568a156/deeplabcut/pose_estimation_tensorflow/train.py#L159

See more here: https://www.tensorflow.org/api_docs/python/tf/Session

A session may own resources, such as tf.Variable , tf.QueueBase , and tf.ReaderBase . It is important to release these resources when they are no longer required. To do this, either invoke the tf.Session.close method on the session, or use the session as a context manager. The following two examples are equivalent:

# Using the `close()` method.
sess = tf.Session()
sess.run(...)
sess.close()

So, in short, to be safe run sess.close() :slight_smile:

That explains it. Many thanks :slight_smile:

1 Like