Normalizing the x,y coordinates for different image sizes

A two part question:

The output from DLC provides you with the x,y coordinates for any joint within each image.

However this is relative to the image size and also relative to the distance of the animal from the camera. Not a problem if all the video you have is the same size and distance to animal is standard, but it makes data analysis difficult if you don’t have either.

Is there a way of normalizing the x,y coordinates to deal with the difference in the image size (some are 640x360, some 1280x720 and others 1920x1090). I had initially thought:

xNorm = x/width of image
yNorm = y/height of image

But I wasn’t sure if that properly dealt with maintaining the aspect ratio of the image?

On the distance of the animal from the camera, once the image size is normalized and the x,y coordinates reflect a normalized position, is there a way of engineering each x,y position further to reflect the animals distance from the camera without knowing the focal length? Obviously I dont want to create a new x,y that destroys the actual physical size of the animal (e.g small/large). I was thinking of some type of ratio relative to the distance to 0,0 but my domain knowledge of image geometry has let me down.

That seems reasonable to me. Why do you think it should affect the aspect ratio?

While you might have some luck guessing the distance from the apparent size of the animal in the video, the can only be a very rough estimate.
A better approach would be to calibrate your field of view, e.g. by putting a ruler at different distances from the camera. If your animal moves on a two-dimensional plane (and not significantly in three dimensions), you can then deduce the depth position from the x,y coordinates in the field of view.