BatchSize parameter not improving performance

Bit of a strange one. Ive noticed that changing the batchsize during inference has zero change on the length of time it takes to analyse a video.
Please see this comparison of a batchsize of 4 vs a batchsize of 64. I have tried multiple other batchsizes and resolutions too.

64_batch 4_batch

I wonder if this is due to a bottleneck elsewhere in the inference code?

This is on a multi-animal project but I have observed the same phenomena in single animal.

Ill take a deeper look at the code and see if theres anything I can improve for now :slight_smile:

Forgot to mention I have replicated this issue on both the pypi package and most recent git branch.
Keen to hear if anyone else is experiencing this as if so we may be able to speed up performance drastically :slight_smile: (with a bit of luck!)

Oh and I believe its a bottleneck and not a failure to correctly use the batchsize parameter as increasing the batchsize indefinitely will eventually lead to a gpu memory failure.

during inference it does seem that several of the cpu cores are maxed out whilst the gpu is not fully utilised. Do you guys think it would be possible to do the cost calculation steps in tensorflow rather than numpy?

Using CuPy might be a good option (

cc @AlexanderMathis

I actually tried that yesterday but theres no cupy implementation on np.trapz just yet. I think if i could get it running in tensorflow itd be ideal. What are your thoughts?