Capture left and right foot separately

Hi all,

I’m currently trying to capture as well as possible the position of the feet of my (human) subjects. I’m not entirely satisfied with the results : I want it to be able to always distinguish between right and left foot. So far, I’ve trained my own resnet50 network, and I’ve done it by 2 different ways :

  1. train “from scratch” a network up to, say, 40k iterations, then add new videos (of different subjects), and retrain for 25k, then extract the outliers on one of the videos, and retrain for 25k. At this point, the results are quite good, but not “perfect”, ie when feet cross (at the middle of the swing phase), my network sometimes (not always) mixes the 2 feet.

  2. train “from scratch” a network with the whole set of labeled frames mentioned in 1) above in one go. I ran it for 100k (the loss plateaued after 60k, around 0.0008). And the results are significantly worse than strategy 1), ie left and right foot are sometimes mixed for a whole gait cycle.

This brings several questions :
-are some of you surprised with these results? I would expect both results to be “roughly the same”, with perhaps 2) reaching faster the objective…
-Do you see any flaws in the 2 strategies? (e.g. in 1), I extracted outliers only on one of the videos, but perhaps this does not make sense, and “all” the errors must be corrected???)
-Do some of you have any experience with that kind of challenge (ie capturing the feet), is there some specific “recipes” (type of networks, type of optimizers, workflow, etc…) to help better results?

I’ve also tried the “pre-trained” human network (model zoo). It works very well, but it is too detailed for me (I don’t need the upper limbs), can it be used with a different number of body parts?

(I don’t expect direct answers to my multiple questions, they are there just to clarify my thoughts : any comments of any kind are welcome)

Thanks a lot!


I can see retraining the network having a more positive effect than simply running more iterations.

In any case, your best bet would be to use the pre-trained human network and simply filter out the rest of the body. You can specifically pick and choose body parts.

If you want to create a labeled video without the upper body, you can create a labeled video with specific body parts by modifying the displayedbodyparts parameter in the create_labeled_video() function which defaults to ‘all’.
displayedbodyparts: list of strings, optional
This select the body parts that are plotted in the video. Either all, then all body parts
from config.yaml are used orr a list of strings that are a subset of the full list.
E.g. [‘hand’,‘Joystick’] for the demo Reaching-Mackenzie-2018-08-30/config.yaml to select only these two body parts.

You can read the docs at: