Headless mode hangs without error


I’m trying to run a pipeline in cellprofiler headless, though it’s getting stuck before the LoadData module. I’ve been running this build of cellprofiler headless without any issues in the past and it still runs those same pipelines successfully, so I think there might be something amiss with my pipeline.

It sits at INFO:PipelineStatistics:Pipeline saved with CellProfiler version 20150608183859 indefinitely. If I run with DEBUG log level I get:

DEBUG:cellprofiler.utilities.cpjvm:JVM will be started with AWT in headless mode
DEBUG:javabridge.jutil:Creating JVM object
DEBUG:javabridge.jutil:Signalling caller
DEBUG:cellprofiler.utilities.cpjvm:Enabled Bio-formats directory cacheing
INFO:PipelineStatistics:Pipeline saved with CellProfiler version 20150608183859
DEBUG:cellprofiler.measurement:Created temporary file /local/8706821.1.igmm_long/Cpmeasurements0X_c19.hdf5
DEBUG:cellprofiler.measurement:/exports/eddie3_homes_local/s1027820/virtualenv-1.10/myVE/bin/cellprofiler: (9 <module>): load_entry_point('CellProfiler==2.4.0rc1', 'console_scripts', 'cellprofiler')()
DEBUG:cellprofiler.measurement:/gpfs/igmmfs01/datastore/Drug-Discovery/scott/CellProfiler/cellprofiler/__main__.py: (146 main): run_pipeline_headless(options, args)
DEBUG:cellprofiler.measurement:/gpfs/igmmfs01/datastore/Drug-Discovery/scott/CellProfiler/cellprofiler/__main__.py: (683 run_pipeline_headless): initial_measurements=initial_measurements
DEBUG:cellprofiler.measurement:/gpfs/igmmfs01/datastore/Drug-Discovery/scott/CellProfiler/cellprofiler/pipeline.py: (1677 run): copy=initial_measurements)
DEBUG:cellprofiler.measurement:/gpfs/igmmfs01/datastore/Drug-Discovery/scott/CellProfiler/cellprofiler/measurement.py: (271 __init__): for frame in traceback.extract_stack():
DEBUG:cellprofiler.utilities.hdf5_dict:HDF5Dict.__init__(): /local/8706821.1.igmm_long/Cpmeasurements0X_c19.hdf5, temporary=True, copy=None, mode=w
DEBUG:cellprofiler.utilities.hdf5_dict:HDF5Dict.flush(): /local/8706821.1.igmm_long/Cpmeasurements0X_c19.hdf5, temporary=True

I’m running the command:

cellprofiler -r -c -p /exports/eddie/scratch/s1027820/JD_FAK/WT_test.cppipe --data-file=load_data_merged.csv -f 1 -l 20 -o /exports/eddie/scratch/s1027820/JD_FAK_1

Any help would be great, as I’m pretty stumped.

Solved my own problem. For the records:

Looks like LoadData doesn’t like large numbers of images, or it takes several hours for cellprofiler.utilities.hdf5_dict:HDF5Dict.flush() to do whatever it is it’s doing for 400,000 rows. Taking the first 20 rows of the LoadData input runs fine. I’m guessing this is the reason behind CreateBatchFiles…

Actually, I believe LoadData is the preferred way to set up huge lists of image files. We’ve had projects with millions of images. Hopefully someone else can answer what the issue might be here, and whether it’s plausible you need to indeed wait longer for whatever to happen.

LoadData is indeed the preferred way to set up huge lists of files, but there is some under-the-hood importing happening that slows it down if batch files aren’t used.

When I asked @shsingh about your issue here’s what he had to say:

yes, both --data-file and --file-list are slow and I use BatchFile to speed up.

In https://github.com/shntnu/bpcp/blob/master/run_analysis_pipeline.sh, I create a BatchFile because I need to run the same pipeline multiple times, once per group. There is a setup cost and subsequent runs are faster.

But I don’t know what happen if there are hundreds of thousands of files. Presumably CreateBatchFile will help because that’s what Mark and David used to use.

My suggestion would be to process everything one plate at a time. Its much easier to keep track of things that way.

Thank you both, sounds like splitting up the workflow and using CreateBatchFiles is a more sensible strategy for larger screens.