I have been using deeplabcut (DLC) in my laptop and it perfectly works. However, training is really long and then I started doing it inside a cluster. First of all it seemed that it was doing the training correctly, however I didn’t get the files I was supposed to. Then I upgraded DLC, and I am currently using: “DeepLabCut/22.214.171.124-Anaconda3-GPU”. With this new version, I’m trying to run deeplabcut but it is not working, and I don’t really understand why.
When I was using the old DLC version, I orderered it to run for 200 000 iterations and it took 4 hours aprox. It ran for 200 000 iterations because I was typing: deeplabcut.train_network(config_path,shuffle=1, trainingsetindex=0, gputouse=None, displayiters=1000, saveiters=20000, maxiters=200000) (as it’s written in the nature protocol). However, as I said before, I didn’t get the files I was supposed to.
Now with the upgraded version it makes no iterations and says the training is stopped due to time limit. The cluster I’m using has a time limit of 12hours. But this error makes no sense because usually it takes 4 hours to make 200 000 iterations. I write below what it says:
**WARNING:tensorflow:From /opt/sci-soft/software/DeepLabCut/126.96.36.199-Anaconda3-GPU/lib/python3.6/site-packages/tensorflow/python/framework/op_def_l$** **Instructions for updating:** **Colocations handled automatically by placer.** **WARNING:tensorflow:From /opt/sci-soft/software/DeepLabCut/188.8.131.52-Anaconda3-GPU/lib/python3.6/site-packages/tensorflow/python/ops/losses/losses_$** **Instructions for updating:** **Use tf.cast instead.** **WARNING:tensorflow:From /opt/sci-soft/software/DeepLabCut/184.108.40.206-Anaconda3-GPU/lib/python3.6/site-packages/tensorflow/python/training/saver.py:$** **Instructions for updating:** **Use standard file APIs to check for files with this prefix.** **DLC loaded in light mode; you cannot use any GUI (labeling, relabeling and standalone GUI)** **Switching batchsize to 1, as default/tensorpack/deterministic loaders do not support batches >1. Use imgaug loader.** **Starting with standard pose-dataset loader.** **Initializing ResNet** **Loading ImageNet-pretrained resnet_50** **slurmstepd: error: *** JOB 1867327 ON gpuceib01 CANCELLED AT 2020-03-11T06:26:16 DUE TO TIME LIMIT *****
I don’t really understand what is happening, as I have not a big knowledge of informatics. Maybe it’s really simple but I don’t get where’s the mistake. I would really appreciate your help.