Learning rate optimization and suggested lr steps for ADAM

I am using ADAM optimizer and was wondering if there is a good initial set of multi_step values that go up to 500k-1M iterations. As wonderful @AlexanderMathis suggested in this post: Recommended Settings for Tracking Fine Parts. I would like to go beyond 50k iterations but don’t have the intuition yet on what lrs to use as the iteration number increases.

thanks so much!

-mehmet

with batchsize of 8, you really don’t want to go much beyond 50K-100k (that’s like 400K-800K already). If you want to do this, then you just change the 50,000 to your desired number, but again, be aware of over fitting…

1 Like

Thank you! Unfortunately, my GPU can’t handle more than batch_size>1 even though it’s pretty decent (2080 Super). That’s why I want to increase the number of iterations.

1 Like

Ah, then if your images are very large, you might consider 2 things: one, changing global_scale to be smaller (this downsamples the images), and 2 when you run training, set i.e., dlc.train_network set allowgrowth=True. Hope that also helps! And, if it’s batchsize 4, then just double the iter # for each learning rate step in the schedule @AlexanderMathis suggests is a good place to start.

You might also decrease the learning rate slightly when decreasing the batchsize. (Also as the post says for Adam, you can just keep the rate)

See this great post:

" Theory suggests that when multiplying the batch size by k, one should multiply the learning rate by sqrt(k) to keep the variance in the gradient expectation constant. See page 5 at A. Krizhevsky. One weird trick for parallelizing convolutional neural networks : https://arxiv.org/abs/1404.5997

However, recent experiments with large mini-batches suggest for a simpler linear scaling rule, i.e multiply your learning rate by k when using mini-batch size of kN. See P.Goyal et al.: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour https://arxiv.org/abs/1706.02677

I would say that with using Adam, Adagrad and other adaptive optimizers, learning rate may remain the same if batch size does not change substantially."

As of the number, you probably notices that the schedule I shared will automatically stop at 50k.

(and for 1/2 the rate:)
just change the limit: cfg_dlc[‘multi_step’]=[[5e-5, 7500], [1e-5, 12000], [1e-6, 10000]]

i.e.

trainposeconfigfile,testposeconfigfile,snapshotfolder=deeplabcut.return_train_network_path(path_config_file, shuffle=shuffle ,trainFraction=cfg["TrainingFraction"][trainingsetindex])
cfg_dlc=deeplabcut.auxiliaryfunctions.read_plainconfig(trainposeconfigfile)

cfg_dlc['scale_jitter_lo']= 0.5
cfg_dlc['scale_jitter_up']=1.5

cfg_dlc['augmentationprobability']=.5
cfg_dlc['batch_size']=4 #pick that as large as your GPU can handle it
cfg_dlc['elastic_transform']=True
cfg_dlc['rotation']=180
cfg_dlc['covering']=True
cfg_dlc['motion_blur'] = True
cfg_dlc['optimizer'] ="adam"
cfg_dlc['dataset_type']='imgaug'
cfg_dlc['multi_step']=[[5e-5, 7500], [1e-5, 12000], [1e-6,  **10000**]]

deeplabcut.auxiliaryfunctions.write_plainconfig(trainposeconfigfile,cfg_dlc)

print("TRAIN NETWORK", shuffle)
deeplabcut.train_network(path_config_file, shuffle=shuffle,saveiters=5000,displayiters=500,max_snapshots_to_keep=11)

Thank you @AlexanderMathis @MWMathis for the suggestions. I am excited to test the suggestions.

Best,
Mehmet

2 Likes