CellProfiler batch jobs failing with LSF

Hi everyone,

I’m a Unix Systems Administrator at the Whitehead Institute, and I’m trying to get CellProfiler batchmode working with our LSF cluster environment. I’m experiencing some issues submitting the batches into individual jobs.

Here is my generic bash script that reads in Batch_data.h5, determines the number of images, and breaks it up into multiple jobs:

#! /bin/bash

FILENAME=$1

# Get the directory where <your_pipeline.h5> lives
DEST=$(dirname "$FILENAME")

# Get the number of images we're using from the .h5 file
TOTALIMG=`h5dump $FILENAME|grep file:|wc -l`

# If the number of images is less than 15, tweak the math correctly
NEXT=$((TOTALIMG/15))
if  $NEXT -eq 0 ]; then
        NEXT=1
        LOOPTIME=$TOTALIMG
else
        LOOPTIME=16
fi

# Loop through and submit our jobs
HOLD=0
for x in `seq $LOOPTIME`; do
        OLDHOLD=$(($HOLD+1))
        HOLD=$(($HOLD+$NEXT))

        echo "bsub -u dennisr CellProfiler.py -p $FILENAME -c -b -f $OLDHOLD -l $HOLD $DEST/DefaultOUT$x.mat"
        bsub -q idle -u dennisr "CellProfiler.py -p $FILENAME -c -b -f $OLDHOLD -l $HOLD $DEST/DefaultOUT$x.mat"
done

Some of the jobs finish successfully. Others, not so much- The odd problem is that the error message is sometimes different for the jobs that fail.

Here’s some failed job output:

Exited with exit code 1.

The output (if any) follows:

/usr/local/lib/python2.6/dist-packages/pycrypto-2.6-py2.6-linux-x86_64.egg/Crypto/Util/number.py:57: PowmInsecureWarning: Not using mpz_powm_sec.  You should rebuild using libgmp >= 5 to avoid timing attack vulnerability.
  _warn("Not using mpz_powm_sec.  You should rebuild using libgmp >= 5 to avoid timing attack vulnerability.", PowmInsecureWarning)
Version: 2012-10-09T20:38:44 Unknown rev. / 20121009203844
Pipeline saved with CellProfiler version 20120927181948
Times reported are CPU times for each module, not wall-clock time
Uncaught exception in CellProfiler.py
Traceback (most recent call last):
  File "/usr/local/share/cellprofiler/CellProfiler.py", line 478, in <module>
    pipeline.save_measurements(args[0], measurements)
  File "/usr/local/share/cellprofiler/cellprofiler/pipeline.py", line 903, in save_measurements
    add_all_measurements(handles, measurements)
  File "/usr/local/share/cellprofiler/cellprofiler/pipeline.py", line 238, in add_all_measurements
    max_image_number = np.max(image_numbers)
  File "/usr/local/lib/python2.6/dist-packages/numpy-1.6.2-py2.6-linux-x86_64.egg/numpy/core/fromnumeric.py", line 1833, in amax
    return amax(axis, out)
ValueError: zero-size array to maximum.reduce without identity
Traceback (most recent call last):
  File "/usr/local/share/cellprofiler/CellProfiler.py", line 478, in <module>
    pipeline.save_measurements(args[0], measurements)
  File "/usr/local/share/cellprofiler/cellprofiler/pipeline.py", line 903, in save_measurements
    add_all_measurements(handles, measurements)
  File "/usr/local/share/cellprofiler/cellprofiler/pipeline.py", line 238, in add_all_measurements
    max_image_number = np.max(image_numbers)
  File "/usr/local/lib/python2.6/dist-packages/numpy-1.6.2-py2.6-linux-x86_64.egg/numpy/core/fromnumeric.py", line 1833, in amax
    return amax(axis, out)
ValueError: zero-size array to maximum.reduce without identity
Exited with exit code 1.

The output (if any) follows:

/usr/local/lib/python2.6/dist-packages/pycrypto-2.6-py2.6-linux-x86_64.egg/Crypto/Util/number.py:57: PowmInsecureWarning: Not using mpz_powm_sec.  You should rebuild using libgmp >= 5 to avoid timing attack vulnerability.
  _warn("Not using mpz_powm_sec.  You should rebuild using libgmp >= 5 to avoid timing attack vulnerability.", PowmInsecureWarning)
Version: 2012-10-09T20:48:12 Unknown rev. / 20121009204812
Pipeline saved with CellProfiler version 20120927181948
Times reported are CPU times for each module, not wall-clock time
Tue Oct  9 16:48:15 2012: Image # 1, module LoadImages # 1: 7.15 sec (bg)
Tue Oct  9 16:48:18 2012: Image # 1, module IdentifyPrimaryObjects # 2: 2.10 sec (bg)
Tue Oct  9 16:48:21 2012: Image # 1, module ImageMath # 3: 0.52 sec
Tue Oct  9 16:48:21 2012: Image # 1, module IdentifySecondaryObjects # 4: 0.77 sec
Tue Oct  9 16:48:22 2012: Image # 1, module IdentifyTertiaryObjects # 5: 0.12 sec
Tue Oct  9 16:48:22 2012: Image # 1, module MeasureObjectIntensity # 6: 0.63 sec (bg)
Tue Oct  9 16:48:23 2012: Image # 1, module MeasureObjectNeighbors # 7: 0.64 sec (bg)
Tue Oct  9 16:48:24 2012: Image # 1, module MeasureObjectSizeShape # 8: 6.14 sec (bg)
Tue Oct  9 16:48:30 2012: Image # 1, module MeasureTexture # 9: 10.66 sec (bg)
Tue Oct  9 16:48:41 2012: Image # 1, module SaveImages # 10: 0.02 sec
Tue Oct  9 16:48:41 2012: Image # 1, module SaveImages # 11: 0.02 sec
Tue Oct  9 16:48:41 2012: Image # 1, module SaveImages # 12: 0.02 sec
Tue Oct  9 16:48:41 2012: Image # 1, module SaveImages # 13: 0.02 sec
Tue Oct  9 16:48:42 2012: Image # 1, module SaveImages # 14: 0.01 sec
Tue Oct  9 16:48:42 2012: Image # 1, module CreateBatchFiles # 15: 0.00 sec (bg)
Tue Oct  9 16:48:42 2012: Image # 2, module LoadImages # 1: 0.77 sec (bg)
Tue Oct  9 16:48:43 2012: Image # 2, module IdentifyPrimaryObjects # 2: 1.87 sec (bg)
Tue Oct  9 16:48:45 2012: Image # 2, module ImageMath # 3: 0.52 sec
Tue Oct  9 16:48:45 2012: Image # 2, module IdentifySecondaryObjects # 4: 0.85 sec
Tue Oct  9 16:48:46 2012: Image # 2, module IdentifyTertiaryObjects # 5: 0.12 sec
Tue Oct  9 16:48:47 2012: Image # 2, module MeasureObjectIntensity # 6: 0.62 sec (bg)
Tue Oct  9 16:48:47 2012: Image # 2, module MeasureObjectNeighbors # 7: 0.65 sec (bg)
Error detected during run of module MeasureObjectSizeShape
Traceback (most recent call last):
  File "/usr/local/share/cellprofiler/cellprofiler/pipeline.py", line 1306, in run_with_yield
    module.run(workspace)
  File "/usr/local/share/cellprofiler/cellprofiler/modules/measureobjectsizeshape.py", line 290, in run
    self.run_on_objects(object_group.name.value, workspace)
  File "/usr/local/share/cellprofiler/cellprofiler/modules/measureobjectsizeshape.py", line 399, in run_on_objects
    self.record_measurement(workspace, object_name, f, m)
  File "/usr/local/share/cellprofiler/cellprofiler/modules/measureobjectsizeshape.py", line 455, in record_measurement
    data)
  File "/usr/local/share/cellprofiler/cellprofiler/workspace.py", line 143, in add_measurement
    self.measurements.add_measurement(object_name, feature_name, data)
  File "/usr/local/share/cellprofiler/cellprofiler/measurements.py", line 476, in add_measurement
    self.hdf5_dict[object_name, feature_name, image_set_number] = data
  File "/usr/local/share/cellprofiler/cellprofiler/utilities/hdf5_dict.py", line 238, in __setitem__
    dest = self.find_index_or_slice(idxs, val)
  File "/usr/local/share/cellprofiler/cellprofiler/utilities/hdf5_dict.py", line 328, in find_index_or_slice
    feature_group = self.top_group.require_group(object_name).require_group(feature_name)
  File "/usr/local/lib/python2.6/dist-packages/h5py-2.0.1-py2.6-linux-x86_64.egg/h5py/_hl/group.py", line 115, in require_group
    grp = self[name]
  File "/usr/local/lib/python2.6/dist-packages/h5py-2.0.1-py2.6-linux-x86_64.egg/h5py/_hl/group.py", line 127, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._shared.lapl)
  File "h5o.pyx", line 178, in h5py.h5o.open (h5py/h5o.c:2751)
KeyError: "unable to open object (Symbol table: Can't open object)"
Tue Oct  9 16:48:48 2012: Image # 2, module MeasureObjectSizeShape # 8: 2.26 sec (bg)
Uncaught exception in CellProfiler.py
Traceback (most recent call last):
  File "/usr/local/share/cellprofiler/CellProfiler.py", line 464, in <module>
    initial_measurements = initial_measurements)
  File "/usr/local/share/cellprofiler/cellprofiler/pipeline.py", line 1149, in run
    initial_measurements = measurements):
  File "/usr/local/share/cellprofiler/cellprofiler/pipeline.py", line 1395, in run_with_yield
    measurements.flush()
  File "/usr/local/share/cellprofiler/cellprofiler/measurements.py", line 190, in flush
    self.hdf5_dict.flush()
  File "/usr/local/share/cellprofiler/cellprofiler/utilities/hdf5_dict.py", line 183, in flush
    self.hdf5_file.flush()
  File "/usr/local/lib/python2.6/dist-packages/h5py-2.0.1-py2.6-linux-x86_64.egg/h5py/_hl/files.py", line 167, in flush
    h5f.flush(self.fid)
  File "h5f.pyx", line 105, in h5py.h5f.flush (h5py/h5f.c:1876)
RuntimeError: unable to flush file's cached information (File accessability: Unable to flush data from cache)
Traceback (most recent call last):
  File "/usr/local/share/cellprofiler/CellProfiler.py", line 464, in <module>
    initial_measurements = initial_measurements)
  File "/usr/local/share/cellprofiler/cellprofiler/pipeline.py", line 1149, in run
    initial_measurements = measurements):
  File "/usr/local/share/cellprofiler/cellprofiler/pipeline.py", line 1395, in run_with_yield
    measurements.flush()
  File "/usr/local/share/cellprofiler/cellprofiler/measurements.py", line 190, in flush
    self.hdf5_dict.flush()
  File "/usr/local/share/cellprofiler/cellprofiler/utilities/hdf5_dict.py", line 183, in flush
    self.hdf5_file.flush()
  File "/usr/local/lib/python2.6/dist-packages/h5py-2.0.1-py2.6-linux-x86_64.egg/h5py/_hl/files.py", line 167, in flush
    h5f.flush(self.fid)
  File "h5f.pyx", line 105, in h5py.h5f.flush (h5py/h5f.c:1876)
RuntimeError: unable to flush file's cached information (File accessability: Unable to flush data from cache)

All of our cluster nodes are exact clones of eachother, and I’ve already ruled out any possible discrepancies between machines that might cause the error.

If I run CellProfiler with all of the images, it completes fine: CellProfiler.py -p Batch_data.h5 -c -b -f 1 -l 48 ./DefaultOUT.mat

Anyone have any ideas for me?

Is my math wrong in breaking up the batch jobs? Is there some sort of file locking that is preventing the other jobs from running?

I see you’re using the trunk build. I tried a technique in the hdf5 library, h5py, that worked for the PC but failed on the Mac. Perhaps the same issue is occurring for you - I made the fix on 9/21 and it looks like you are using a version that’s tagged from the day before. As a first step, I’d recommend pulling from the master branch again and trying again.