Suggestion for understanding the Principle of this method

This is a great tool for tracking the animal’s motion. I have applied this method on my bat’s ear and nose motions, the results are very great.

But I have a few problems with understanding the principles. I know we are using pre-trained ResNet from ImageNet, why we still need to train the DNN again since we already have the weight matrixes? For example, I am using bats, why don’t we select the last layer of ‘bats’ weight out from ResNet-50, and then directly use for next step (Deconvolutional layers)?