Restarting training from a particular snapshot in Colab

I am using Google Colab to train DLC. For training, we are using 6 videos (about 600 -1000 frames each) of guppies that were caught in different locations and have different prey types. The 6 videos were chosen based on if the video has a black background/white fish, white background/black fish, or grey background/grey-ish fish so that DLC can train on different looking videos. From the training, we would like to analyze a few hundred videos we have of the guppies to analyze feeding kinematics. When using Colab, we time out of the GPU around 100,000 iterations which is causing issues in the training and results in high body part detection error. In the paper it says to pick up where training left off follow

However, when changing the path of the posse_config file, Colab recognizes the new path but the new training is seemingly never saved. If we train the first time to 100K iterations and pick up training for another 89k iterations before timeout, all previous iterations and checkpoints are overwritten and the only saved iterations are those from the 89k.
Is this possibly due to an error in the change of code we are using?
We tried using init_weights: /usr/local/lib/python3.6/dist-packages/deeplabcut/pose_estimation_tensorflow/models/pretrained/resnet_v1_101.ckpt-snapshot-109800 but DLC did not recognize this. We then went to the checkpoint file and used the path of the last saved checkpoints. init_weights: /content/drive/My Drive/DLC/Examples/SixGuppyTrainingVideos-victoria-2019-07-22/dlc-models/iteration-0/SixGuppyTrainingVideosJul22-trainset60shuffle1/train/snapshot-109800
We were able to start training with the path above but with no increase in the quality of training and no way to tell if the new iteration are actually being added to previous training iterations.
Thanks in advance!

Guppy for good luck :slight_smile:

Thanks for the guppy :slight_smile:

So, the short answer is right now the weights overwrite the older ones, so indeed if it stopped at 89K, the sum would be 189K (assuming you loaded at 100K). This is probably something we should change (@AlexanderMathis); a way to get around this is to create a new merge + new training set, which creates an iteration-1, and saves the weights in a new folder. Or, of course, you can keep a copy of the original weights elsewhere…

I’m going to piggyback off of this thread since it’s a related topic. I was using colab to train the network and reached 143k iterations before the GPU timed out. When I go to evaluate the network, I’m getting an error saying that the iteration file does not exist. I’ve run this before with a much smaller iteration count just to make sure the program was working properly and it was fine, so I think since the GPU timed out something went wrong. That being said, the network appears to have only saved 77k iterations, but it made it to 143k in training. Not sure how to fix this, so any advice would be great. Thanks!chrome_8CSqtdJ7nd chrome_qp1bZHFHQ4

how often are you saving the snapshots? You have to set this when you train, with save_iters

My save_iters was set to 500.

Strange, is 77000, 76500, etc. saved?

I assume so, how would I check that? I just opened my Google Drive to look for it, and noticed that I’m basically maxed out on storage. Could it be that during training, I reached the storage capacity of my drive and that prevented saving past 77k iterations?

[you can check by looking at which snapshots are in the /train folder]

I see, so that is probably what happened. Additional snapshots could not be stored…

Okay, in the /train folder I have snapshot-77000.meta, snapshot-76500.meta, snapshot-76000.meta, etc. I think you’re right to say that it’s a storage issue. Glad to know there’s an easy fix for that.

Now I’m just wondering why I couldn’t evaluate the network with the current saved iterations. If the snapshots exist, shouldn’t the evaluation still run?

A problem I have faced is that the last snapshot is incomplete; there should be 3 files per snapshot but you only have two. Check and you can just delete the incomplete snapshot.

Also I have had the storage problem with colab too, the problem is the files go to your Google drive trash but don’t get deleted so you just accumulate lots of them which add up pretty quickly. I wonder if there is a way to fix this (like have them be deleted directly, not go to trash).


Thanks for the tip, that fixed the issue and the evaluation is running.

1 Like

Is there a way to find the weights? (I’m not sure which folder you are referring to that stores the weights). I’m on my 4th “iteration” and I’m not sure how many weights there are total.

In the protocol paper (and on GitHub), the project folder structure is shown, i.e. Here you can see “dlc-models” is where the weights are. And from github:

dlc-models: This directory contains the subdirectories test and train , each of which holds the meta information with regard to the parameters of the feature detectors in configuration files. The configuration files are YAML files, a common human-readable data serialization language. These files can be opened and edited with standard text editors. The subdirectory train will store checkpoints (called snapshots in TensorFlow) during training of the model. These snapshots allow the user to reload the trained model without re-training it, or to pick-up training from a particular saved checkpoint, in case the training was interrupted.