Understanding readout and location refinement better

First, thank you for developing this amazing software toolbox.
I appreciate the active development and community involvement in this project.

I have questions concerning the model architecture of the score maps and location refinement maps.

As far as I understand, DeeperCut adapts the ResNet by removing global average pooling and the FC layer + softmax.
The last conv bank’s stride is changed to 1 and dilated convolutions are done, to prevent down-sampling of feature maps.

Two (?) independent deconvolutional layers are appended for 2x upsampling of feature maps and each is fused with the output feature maps of conv3 for finer details. The output size is then 8x coarser than input resolution, meaning that one cannot simply lay the output maps on the input image to see the activations.

The first deconv layer produces the score maps:
For each labeled body part, a distinct feature map is produced and sigmoid activations are applied, so that each pixel expresses a probability value between 0 and 1 (body part present in this area or not). This is then fed into a fully connected layer with softmax loss.

Now here is the part where I get very unsure:
The second deconc layer is supposed to refine the x and y coordinates of the score maps (because the output is only 1/8 the resolution). This is done by regression on the high-resolution x and y coordinates in the input image with a robust L1 in a following FC layer.

The target labeled pixel is expanded to include 17 pixels target threshold around the center.

Now my questions are:

  1. How far off is my understanding of the readout layers above?
  2. How is inference done now? Are the stacked score maps slid over each pixel and simply the (body part) score map with the highest probabilty is taken to create a marker?
  3. DeeperCut refers to the Fast-RCNN paper for location refinement. Does that mean that DLC also use non-maximum suppression and 2D gaussians on overlapping score maps to create the final discrete marker?
  4. Can the regression of location maps be understood as simply callibration?

I apologize if this is confusing. I feel like my confusion stems from DLC only using the first subset of DeeperCut and DeeperCut refering to a subset of Fast R-CNN for the location refinement.

Thank you!