DeepLabCut 2.1.6.2 training not working in cluster

I have been using deeplabcut (DLC) in my laptop and it perfectly works. However, training is really long and then I started doing it inside a cluster. First of all it seemed that it was doing the training correctly, however I didn’t get the files I was supposed to. Then I upgraded DLC, and I am currently using: “DeepLabCut/2.1.6.2-Anaconda3-GPU”. With this new version, I’m trying to run deeplabcut but it is not working, and I don’t really understand why.

When I was using the old DLC version, I orderered it to run for 200 000 iterations and it took 4 hours aprox. It ran for 200 000 iterations because I was typing: deeplabcut.train_network(config_path,shuffle=1, trainingsetindex=0, gputouse=None, displayiters=1000, saveiters=20000, maxiters=200000) (as it’s written in the nature protocol). However, as I said before, I didn’t get the files I was supposed to.

Now with the upgraded version it makes no iterations and says the training is stopped due to time limit. The cluster I’m using has a time limit of 12hours. But this error makes no sense because usually it takes 4 hours to make 200 000 iterations. I write below what it says:

**WARNING:tensorflow:From /opt/sci-soft/software/DeepLabCut/2.1.6.2-Anaconda3-GPU/lib/python3.6/site-packages/tensorflow/python/framework/op_def_l$**
**Instructions for updating:**
**Colocations handled automatically by placer.**
**WARNING:tensorflow:From /opt/sci-soft/software/DeepLabCut/2.1.6.2-Anaconda3-GPU/lib/python3.6/site-packages/tensorflow/python/ops/losses/losses_$**
**Instructions for updating:**
**Use tf.cast instead.**
**WARNING:tensorflow:From /opt/sci-soft/software/DeepLabCut/2.1.6.2-Anaconda3-GPU/lib/python3.6/site-packages/tensorflow/python/training/saver.py:$**
**Instructions for updating:**
**Use standard file APIs to check for files with this prefix.**
**DLC loaded in light mode; you cannot use any GUI (labeling, relabeling and standalone GUI)**
**Switching batchsize to 1, as default/tensorpack/deterministic loaders do not support batches >1. Use imgaug loader.**
**Starting with standard pose-dataset loader.**
**Initializing ResNet**
**Loading ImageNet-pretrained resnet_50**
**slurmstepd: error: *** JOB 1867327 ON gpuceib01 CANCELLED AT 2020-03-11T06:26:16 DUE TO TIME LIMIT *****

I don’t really understand what is happening, as I have not a big knowledge of informatics. Maybe it’s really simple but I don’t get where’s the mistake. I would really appreciate your help.

HI Anna

welcome to the forum. That all looks fine, i.e. loaded model, etc; Do you know the it was really 12 hours between:

**Loading ImageNet-pretrained resnet_50**
and
**slurmstepd: error: *** JOB 1867327 ON gpuceib01 CANCELLED AT 2020-03-11T06:26:16 DUE TO TIME LIMIT *****

I would ask your IT service who run the cluster to be sure you have admin rights and then check the time for sure.

Also, don’t pass gputouse=None then of course it does not use a GPU. Those are all the optional commands one can pass. The simplest is just deeplabcut.train_network(config_path)

For every function, you can see what is required and what is optional in the “docstring”

Our repo docs has these, and you can type: ``deeplabcut.train_network?` and it shows it in the ternimal too.

1 Like