Large train and test errors when evaluating network


I’ve been getting really large train and test errors using the newest version of DeepLabCut ( compared with an older version of DeepLabCut (2.1.4).

For the same set of labeled videos, training with the older (2.1.4) version yielded a train and test error of 1.47 pixels and 2.36 pixels, respectively. In contrast, the newest ( version gives a train and test error of 118.87 pixels and 117.04 pixels, respectively.

I first noticed the large error after refining labels with extracted outlier frames, so I trained with a set of labeled videos that I know should have given a low error as a control. It seems like something has changed, but I don’t know what.

I just ran this testscript: with the latest DLC ( and the get the same performance as previously (incl. 2.1.4), i.e.:

Done and results stored for snapshot: snapshot-15001
Results for 15001 training iterations: 95 12 train error: 2.9 pixels. Test error: 2.91 pixels.
With pcutoff of 0.4 train error: 2.9 pixels. Test error: 2.91 pixels
Thereby, the errors are given by the average distances between the labels by DLC and the scorer.

How do the labeled test images look (evaluate_network with plotting=True)? Are the predictions good or wildly off as the numbers may suggest?

The labeled test (and training) images are really off. For example, if there are four features that are labeled, the train and test images both have labels approximately where the four features are, but they are mislabeled (e.g. if one feature corresponds to “nose” that is supposed to be labeled with a blue dot, DLC would label this feature with a pink dot (which corresponds to a different feature). It seems like it is consistently mislabeled (so, a blue labeled feature is always mislabeled pink).

Not sure if this make sense? Thanks for your help!

Given all the testing, doesn’t seem a version issue; but I would like to help you solve this.

If you don’t want to share publicly, could you email me an example image you trained with, one test- output from evaluation and one train- ? and your pose_cfg.yaml file from that training?

1 Like

I sent out an email with example test and train images, as well as the pose-cfg.yaml file (on monday night, pacific time). Thanks for your help!

thanks, got it! Will get to it this weekend; but given this was a merged dataset, my sense is there is something off with the merger; if you run check_labels, be sure they all look correct before training!

Great; thanks! I thought it could’ve been merger related too, so I retrained DLC on a project that was already completed and that previously yielded a low test and train error (this is the same project with the refined labels, except before refinement) as a control. I got very large test and train errors this time around, so it may not be related to the merger.

Something that may or may not be relevant: after running the create_training_dataset function, a list gets printed, which contains what looks like a tuple containing a float (training set fraction?), an int and two arrays. The total length of the two arrays is equal to the number of images that were labeled. I haven’t seen this before, and I didn’t see it with the demo data that did work (20200408_Colab_DEMO_mouse_openfield).

Forgot to add - I checked all the labels on both projects (with and without refined labels). The features all seem to be labeled correctly.

1 Like


I have no solution but it may be good to inform, that I also had this issue with one of my projects (so it’s not an independent thing) in version Note that all my previous projects on this version and previous versions worked fine.

Network was trained on quite decent dataset of over 500 labeled frames, I checked them all and they were correct, but the evaluation and analysis of one of the videos yielded results that were as described above. Sort of wrong labels but in right places. Nose label instead of paw label etc. (I’m not sure if they were consitently wrong, that the paw was always a nose, but I don’t think so). I tried creating new dataset, with different resnets, tried removing some folders with labeled frames and creating smaller datasets, creating a new project and using labeled data from the broken one, all with similar results, and training and test error around 100 pixels.

Since then I created a new project, labeled new frames etc. I also worked on few other projects with different datasets and everything works fine, never had the problem again.


Thank you all for raising this issue – hosuk88 gave us a very well documented bug:
that we could reproduce!

We have a bugfix here: will realease it in a bit:

It would be great if you could test, whether this solves your problem too.

We published the bug fix, please update with:

pip install deeplabcut==


Thanks for the bug fix! Unfortunately, still getting large test/train errors with Same behavior as before.

but is this on an old project, as the project itself would be corrupted, thus the code won’t fix it… you’d have to start with iteration 0…

Sorry, my first message may have been unclear. When I saw the large test and train errors after refining labels, I thought it might have had to do with refinement, so I retrained with iteration-0 as a control and still got large test and train errors. This is still the case with

but we tested the code, and there is no difference in this part, so there is no way it’s a version specific thing. please (1) run check_labels and validate your labels fully make sense (2) delete the folder dlc-models and evalution-results (of course feel free to back up the project first), (3) create a brand new training set.

I ran DLC on a completely different project that was made with 2.1.4, and this time I didn’t get large train and test errors (i.e. the errors are as expected). This outcome is perhaps similar to what the other user who commented on this post described.

I’ll try to figure out when the large test and train errors show up for the other project.

1 Like

Update: The other project, after refining labels (iteration-1), yielded low test and train errors when training with DLC 2.1.4

in contrast to DLC This is the same result I got with iteration-0. The only difference between these projects is the DLC version used for training.

Given that this doesn’t seem to be a bug that always comes up, I don’t know how many people’s projects could be affected. If I pinpoint the bug (or at least where it could be coming from), I’ll update again.

Hi @Sarah1. Here is what I would suggest.

Create a model comparison in 2.1.4 so you can have identical train/test splits, otherwise it’s not a fair comparison. Then, you can copy the project (even the same name is fine, just put in two different folders first), then train one in 2.1.4 and 2.1.7, for the same # of iterations. Then, you can do the inverse, create in then test in both 2.1.4 the training. Then we’ve set up an experiment to test the effect on create_training_set and train_network.

Can you also confirm the evaluation images always look wrong?

I tried and have no difference, so I’d like to see if your does so we can figure out what is going on. See the wiki under " How to pick a neural network"