Improving cluster/supercomputer performance (Tesla V100 Volta 16/32GB GPU)

Firstly, DLC is amazing and I love it. For that reason, I am trying to improve our pipeline for analyzing videos but getting stuck at a point I didn’t expect: getting a supercomputer to improve throughput / performance.

I have successfully gotten the DLC 2.1.7 docker running on the Pittsburgh Super Computer XSEDE GPU-AI system by creating a Singularity image. I trained the model on my desktop for easier GUI usage and then put it on the cluster. P.S. if there is any interest in a protocol for that workflow, would be happy to add instructions somewhere. If this workflow is contributing to my problems, also all ears.

What has been frustrating is that running deeplabcut.analyze_videos() on NVIDIA Tesla V100 Volta 32GB GPU on a node with 128GB of RAM (which is all correctly seen by the singularity image), actually calculates fewer iterations/second (17 iter/s) at all batchsizes I’ve tried (16, 32, 64, 128) than running on my desktop with a plain old RTX 2070 (~ 31 iter/s). Does anyone have any intuition for why that is the case or what I might be doing wrong? I don’t mind tinkering with the code. I feel like there should be some speed advantage here… but maybe I’m mistaken!

I know I can run singularity images in parallel with a SLURM job array and I know I can downsample all of my videos. But each call is taking 2x as long … so I feel like that will always be a real limitation. Are there any other DLC options that improve performance on these fancy graphics cards? (batchsize doesn’t seem to do much or maybe I’m using it wrong).The videos are currently 1280x512.

Thank you for any insights anyone may have!!!

Node info:

1 Like

thanks @Brian_Isett for the nice feedback :slight_smile: It’s nice to hear! So, I don’t have experience with the Volta, but I’ll dig around. Briefly, looking at this (here vs. the top card we use): https://technical.city/en/video/Tesla-V100-PCIe-vs-TITAN-RTX there are some big differences. It could be related to volta vs. turning: https://devblogs.nvidia.com/nvidia-turing-architecture-in-depth/

1 Like

Interesting, I would almost suspect the GPU installation of TF is not correct? Are you sure the GPU is being used?

1 Like

Hi Brian,

We were having performance issues with our V100 card as well, 3 times longer run times using tensorflow compared to AWS and and an NVIDIA RTX 8000. We finally figured out the card wasn’t being cooled properly. The temperature read out in nvidia-smi would get to 80C and the GPU would get throttled. Our IT department rigged up a water cooling system for it, and now it is working as expected.

3 Likes

@MWMathis Of course! You guys are doing amazing things! (I wrote up a detailed protocol for using DLC docker as singularity image on a GPU cluster if that’s at all useful to anyone https://github.com/KidElectric/dlc_protocol). I do wonder about the hardware differences – but to @cwood1967 's point–> I think the GPUs have been seeing heavy loads at PSC due to COVID-19 modeling (got an e-mail about this recently). I have tried at different times of day and recently late at night I do get 30+ it/s. So I would chalk this up to possible temperature issues at the cluster (might check this explicitly though–thank you @cwood1967 for your suggestion). I will check in with PSC admins about their cooling regime. @AlexanderMathis yes the GPU is engaged but as mentioned above, I think it might be temperature related. Thank you all for the helpful feedback!

2 Likes

This is great and super useful, I will def. link to the repo! thank you!

1 Like