Tracking of animals with flexible limbs


I am using DeepLabCut to track the whole body of the monkey, including the joints of the limbs and the shape of the tail. I took a total of 9 videos with three cameras at different angles(the angle difference between each camera is 90 degrees), and each video captured 40 frames, which is a total of 360 frames. After training 400K times, I refined the label, and used the refined training set to training 400K from the beginning.


  • The network I got has more accurate tracking of the back, head and other positions, but the tracking effect of the joints of the limbs and the tail has been very poor. The tracking of the limbs is often lost, and the left and right feet are often confused.

  • The tail may also be too variable and flexible, and the tracking effect is poor.

  • Since my camera is set on the side of the monkey (with space constraints), the other side of the monkey will continue to be blocked, which also prevents me from accurately tracking the other side of the limb.

  • The videos have motion blur.


I hope I can track the whole body of the monkey more accurately and solve the problems of occlusion and motion blur.

Thanks a lot for the help of the forum and any suggestions from anyone!

Continuing our conversation from github.

The plots you have shown there and evaluation results tell that the model performs well and the only problem that stayed are the occlusions (you can confirm this by trying to extract outlier frames and looking at how many frames actually have mislabeled bodyparts and how many are just not visible - probably it’s almost only the latter). There is not much you can about it with further retraining and you should focus now on filtering and interpolating the data. If you have no experience in python doing this, I’d advise you to use all the options provided by anipose package (I think I saw your issue there considering calibration, yesterday I posted a solution to the NoneType error occuring during calibration in one of the issues on Anipose github).

Feel free to ask more question but imo, first of all try to filter and interpolate data and see what you can achieve by doing so.

1 Like

Yes, what you see in the issue of anipose github should be me. I am also learning 3D calibration and tracking. I will try your solution later, thank you very much!

In fact, the hands and feet are not often occluded, but the tracking effect is also very poor. After creating tags for the video, I can see that the tracking of the hands and feet disappeared in many frames.

I am currently trying to extract outliers and then perform training operations. Just in case, I want to confirm that after I have extracted the outliers, should I iteratively train on a well-trained network?

Secondly, if there are two parallel cameras in my experiment (that is, the angle between the cameras is 180 degrees, one for the front of the monkey and the other for the back of the monkey), can this solve the occlusion problem of a single camera, or it will make it impossible for getting a well-performing network?

Yes, you can use the already good model as init_weigths. Consider training for less than 100k iterations (depending on batch size). Just cause if you want to keep expanding on top of this model you don’t want to overfit it anytime soon.

Since triangulation is based on pairs of cameras (the triangle being between two cameras and point of interest) you should get somewhat ok results if for instance cameras looking at the back and at the side see the same points. I think it will be most informative when you perform the triangulation and create labeled 3d video. There is also whole process to the calibration itself with how dependent it its on resolution vs quality/size of squares or bits on calibration board, proper presentation of the board when you record calibration videos. It usually takes some time before it works the way you can accept.

1 Like

However, how to calibrate the two cameras that are completely set on opposite sides? According to what I have learned so far, calibration requires a pair of cameras to simultaneously shoot to the calibration board. But the camera on the opposite seems difficult to do it.

And, I use the video captured by a pair of cameras on the opposite side to train a network. What impact will it have on 2D tracking?

Thank you for your patience!

3D reconstruction is a complex process when there are problems with enough cameras or spacing of the cameras introduced by your setup.

  1. Before finetuning the setup (how the cameras are placed) focus on getting best tracking possbile for each camera view, which we can say you probably have now.

  2. Calibrate cameras for the space you were recording (cameras shouldn’t be moved between calibration and recording of the animal). Calibration will work out how the space is perceived from the point of view of each camera in relation to every other camera. Not sure how your calibration video looks like, screenshot form each angle might help with providing advice on recording calibration.

  3. Triangulation will use the information about your calibrated space and put your 2D tracked points into the space, since you want to use anipose, consider adding animal_calibration=True it might help (might also be worse, you should test it)

Before you know how 3D reconstruction actually works out for your data consider 2D tracking to be independent from it and just get best pose estimation possible.

Since the desired result has never been obtained, I bother again.
The network trained with 6 30-second videos performed better than the network trained with 12 4-minute videos. Among them, when using 6 segments of 30-second videos, a total of 240 frames are extracted (refine_label is not even performed in this network); when using 12 segments of 4 minutes videos, a total of 240 frames are also extracted, and after the refine_label is performed, a total of 480 frames are obtained . The training error of the two networks when the training stops are both 0.022.
After comparing the configurations of the two networks, I want to know whether the length of the video is related to the network performance? After all, the actions of a monkey in 4 minutes must be more complicated than half a minute. Therefore, in the face of longer videos with more complex behaviors, what do I need to do to ensure that I can get the same excellent performance as the network trained with short videos before? Get more frames? (In my opinion, there seems to be no room for improvement in the number of training sessions and refine_label.)
Thanks for any reply.

The training error of the two networks when the training stops are both 0.022.

Do you mean the loss?

As for the performance. You have to evaluate if any performance improvement is realisticaly possible. If I remember correctly, there is a lot of occlusions happening in your videos which would show poor results on plots, but if everything was done correctly, provide a good labeled video.

It would be easier to help having some visual on labeled videos (like a short gif comparing labeled videos of the same video analyzed by two different models you made).
From your previous plots it looked like the model was performing fine and it was just a matter of oclussions so I’m not sure what else can be done except modyfing the setup.

Hi @ZiyiZhang0912! The best you can do is making sure the extracted frames are as diverse as possible, and augmenting your dataset post-training with refined frames. I’m not quite sure I follow your reasoning about video length vs complexity though: unless you find a way to quantify scene statistics/behavior complexity across videos, you won’t be certain you strictly evaluate the effect of video length.