Firstly, DLC is amazing and I love it. For that reason, I am trying to improve our pipeline for analyzing videos but getting stuck at a point I didn’t expect: getting a supercomputer to improve throughput / performance.
I have successfully gotten the DLC 2.1.7 docker running on the Pittsburgh Super Computer XSEDE GPU-AI system by creating a Singularity image. I trained the model on my desktop for easier GUI usage and then put it on the cluster. P.S. if there is any interest in a protocol for that workflow, would be happy to add instructions somewhere. If this workflow is contributing to my problems, also all ears.
What has been frustrating is that running deeplabcut.analyze_videos() on NVIDIA Tesla V100 Volta 32GB GPU on a node with 128GB of RAM (which is all correctly seen by the singularity image), actually calculates fewer iterations/second (17 iter/s) at all batchsizes I’ve tried (16, 32, 64, 128) than running on my desktop with a plain old RTX 2070 (~ 31 iter/s). Does anyone have any intuition for why that is the case or what I might be doing wrong? I don’t mind tinkering with the code. I feel like there should be some speed advantage here… but maybe I’m mistaken!
I know I can run singularity images in parallel with a SLURM job array and I know I can downsample all of my videos. But each call is taking 2x as long … so I feel like that will always be a real limitation. Are there any other DLC options that improve performance on these fancy graphics cards? (batchsize doesn’t seem to do much or maybe I’m using it wrong).The videos are currently 1280x512.
Thank you for any insights anyone may have!!!
Node info: