DLC settings for maximum accuracy


as already discussed last December with @MWMathis and @AlexanderMathis, I am using DLC for human pose estimation (golf swing). The aim is to achieve maximum accuracy in detecting the joints of the person. Therefore I am looking for the perfect settings in config.yaml and pose_cfg.yaml. Currently i use following:


  • TrainingFraction: 0.8 as in the cheetah project
  • batch_size: 1


  • global_scale: 1
  • init_weights: mpii-single-resnet-101
  • intermediate_supervision: true
  • intermediate_supervision_layer: 12 (does a different layer # might result in more accurate results?)
  • mirror: true

Does cluster_color=False/True of deeplabcut.extract_frames() has an impact of the accuracy? maybe gets more information when turning cluster_color=True?

Is there something else to consider when trying to achieve maximum accuracy?

config_MaxAcc.txt (1.6 KB)
pose_cfg_MaxAcc.txt (1.8 KB)

The network parameters that will most change performance are pos_dist_threshold (default is 17), global_scale (default is 0.8), resnet (i.e. network depth: 50 or 101), and crop = True + cropratio =.4 --> this is a brand new augmentation step - which will be better documented when the paper comes out (in press now!) - that works very well (see panel b here). You could increase the cropratio if you want, and keep as true (which is the default). If your images are large, be sure you change the max input size to account for scaling (it goes from 0.5 to 1.25 by default, so that means if your images are 1000 pixels, you need your max input to be at least 1250 to account for the scaling).

But, the biggest factor will be really good, error free, labels from a diversity of settings (like the cheetah example)! :slight_smile:

Re: your question-> For extracting frames, color=True also uses color to cluster, so if you have diverse colors, then set this to True to cluster the frames using this feature before extraction.

1 Like

Thanks so much for answering @MWMathis and looking forward to the details of the brand new augmentation steps :slight_smile:

The info is currently in the code, just not in the readme docs: https://github.com/AlexEMG/DeepLabCut/blob/3b10ea5bbb4cdba6ee6c0cc2481f4058f71fe5da/deeplabcut/pose_cfg.yaml#L35 here you go! and the quantification is above (and as you saw with the mice, it’s slightly better even on less challenging things than cheetahs, but really help the challenging applications)

1 Like

When using extract_frames() with the kmeans option, the model says “Extracting and downsampling…”. DLC only downsamples the image for the kmeans the selection and its keeping the maximum resolution of the extracted frame, right?

Correct, but of course you can look at the frame pixel size that is extracted vs your video ;). They are the same, unless you use cropping.

1 Like

Hi @MWMathis & @AlexanderMathis,

i created a big dataset for my golf project with really good, error free labels :blush: as you mentioned above. My brother helped annotating the dataset too, so i have 2 sets of annotations of the same images as you stated in your Neuroscience paper. i computed the human variability (RMSE of mine vs. my brothers annotations) and its around 4 pixel as its not as easy to define where joints are located when wearing loose clothes. I trained a model with my annotations only and trained a model with my brothers & mine annotations. In order to compare, i trained multiple shuffles with different parameter settings, but obviously the test error of the model with my annotations only was always lower. It makes sense to have multiple annotations in order to decrease the human variability, but practically it decreases the performance of the model. As i want to work scientifically correct, my question to you: Would it be sufficient to train the model with my annotations only (as I am the “expert” in golf and detecting joints in loose clothes ;)) and work with a test set which is annotated by ~5 different people (to increase human variability)?

Additionally I split train and test set by separating the people, so that the same person is not included in both train and test set. Of course the test error would decrease when including the same person in the train and test set. Whats the scientifically correct way of splitting the data set in your eyes?

By the way, the pretrained resnet-152 performs much worse than the resnet-101… I am wondering why but i will stick to 101 :slight_smile:

I am looking forward to show you my work i have done with DLC.

Thank you so much!