I’m currently trying to capture as well as possible the position of the feet of my (human) subjects. I’m not entirely satisfied with the results : I want it to be able to always distinguish between right and left foot. So far, I’ve trained my own resnet50 network, and I’ve done it by 2 different ways :
train “from scratch” a network up to, say, 40k iterations, then add new videos (of different subjects), and retrain for 25k, then extract the outliers on one of the videos, and retrain for 25k. At this point, the results are quite good, but not “perfect”, ie when feet cross (at the middle of the swing phase), my network sometimes (not always) mixes the 2 feet.
train “from scratch” a network with the whole set of labeled frames mentioned in 1) above in one go. I ran it for 100k (the loss plateaued after 60k, around 0.0008). And the results are significantly worse than strategy 1), ie left and right foot are sometimes mixed for a whole gait cycle.
This brings several questions :
-are some of you surprised with these results? I would expect both results to be “roughly the same”, with perhaps 2) reaching faster the objective…
-Do you see any flaws in the 2 strategies? (e.g. in 1), I extracted outliers only on one of the videos, but perhaps this does not make sense, and “all” the errors must be corrected???)
-Do some of you have any experience with that kind of challenge (ie capturing the feet), is there some specific “recipes” (type of networks, type of optimizers, workflow, etc…) to help better results?
I’ve also tried the “pre-trained” human network (model zoo). It works very well, but it is too detailed for me (I don’t need the upper limbs), can it be used with a different number of body parts?
(I don’t expect direct answers to my multiple questions, they are there just to clarify my thoughts : any comments of any kind are welcome)
Thanks a lot!