Training on university cluster never starting after loading pose.cfg

I am trying to run a retraining on a university GPU cluster and the training keeps stalling after loading the pose.cfg file. Here is what the log looks like:

2019-12-02 14:54:14 Config:
{‘all_joints’: [[0],
‘all_joints_names’: [‘snout’,
‘batch_size’: 1,
‘bottomheight’: 400,
‘crop’: True,
‘crop_pad’: 0,
‘cropratio’: 0.4,
‘dataset’: ‘training-datasets/iteration-21/UnaugmentedDataSet_lineartrack-bottomupdateNov11/lineartrack-bottomupdate_VC95shuffle1.mat’,
‘dataset_type’: ‘default’,
‘deterministic’: False,
‘display_iters’: 1000,
‘fg_fraction’: 0.25,
‘global_scale’: 0.8,
‘init_weights’: ‘/tigresss/vcorbit/DLC/lineartrack-bottomupdate-VC-2019-11-11/dlc-models/iteration-20/lineartrack-bottomupdateNov11-trainset95shuffle1/train/snapshot-1030000’,
‘intermediate_supervision’: False,
‘intermediate_supervision_layer’: 12,
‘leftwidth’: 400,
‘location_refinement’: True,
‘locref_huber_loss’: True,
‘locref_loss_weight’: 0.05,
‘locref_stdev’: 7.2801,
‘log_dir’: ‘log’,
‘max_input_size’: 1500,
‘mean_pixel’: [123.68, 116.779, 103.939],
‘metadataset’: ‘training-datasets/iteration-21/UnaugmentedDataSet_lineartrack-bottomupdateNov11/Documentation_data-lineartrack-bottomupdate_95shuffle1.pickle’,
‘min_input_size’: 64,
‘minsize’: 100,
‘mirror’: False,
‘multi_step’: [[0.005, 10000],
[0.02, 430000],
[0.002, 730000],
[0.001, 1030000]],
‘net_type’: ‘resnet_50’,
‘num_joints’: 13,
‘optimizer’: ‘sgd’,
‘pos_dist_thresh’: 17,
‘project_path’: ‘/tigress/vcorbit/DLC/lineartrack-bottomupdate-VC-2019-11-11’,
‘regularize’: False,
‘rightwidth’: 400,
‘save_iters’: 5000,
‘scale_jitter_lo’: 0.5,
‘scale_jitter_up’: 1.25,
‘scoremap_dir’: ‘test’,
‘shuffle’: True,
‘snapshot_prefix’: ‘/tigress/vcorbit/DLC/lineartrack-bottomupdate-VC-2019-11-11/dlc-models/iteration-21/lineartrack-bottomupdateNov11-trainset95shuffle1/train/snapshot’,
‘stride’: 8.0,
‘topheight’: 400,
‘weigh_negatives’: False,
‘weigh_only_present_joints’: False,
‘weigh_part_predictions’: False,
‘weight_decay’: 0.0001}

On my cluster output log is also says this:
DLC loaded in light mode; you cannot use the relabeling GUI!
DLC loaded in light mode; you cannot use the labeling GUI!
Switching batchsize to 1, as default/tensorpack/deterministic loaders do not support batches >1. Use imgaug loader.
Starting with standard pose-dataset loader.
Initializing ResNet
Loading already trained DLC with backbone: resnet_50

I had started training this exact network a previous time but forgot to change the init weights, so I stopped the training (which had started successfully), recreated the training dataset from scratch, and adjusted the pose.cfg file init weights. Now, it’s not starting training, so I’m wondering if something is wrong with the way I’m entering the init weights? But the previous retraining iteration of this network trained on the cluster just fine.

However, another network that also never started on the university cluster is running perfectly fine on my local PC with a GPU.

So I’m not sure what’s going on and there aren’t any error messages being thrown… any advice would be appreciated!