CellProfiler 4.1.3 Module: ExportToDatabase - Wall_time

Hi Everyone,

First of all i have to mention that i am a new user of cellprofiler and my experience therefore is limited. I am currently using CP 4.1.3. In my analyis, i am analysing roughly 200.000 images with 4 images per image set i.e. 50.000 image sets. Since CP 4.1.3 does not use the HDF5 file format anymore but the more accessible database type SQLite by using the ExportToDatabase module at the end of my project pipeline. I am experiencing long wall_time (image below). When i run only a selection of imagesets, lets say 50, i am not experiencing these long wall_time. Does anyone maybe have a tip or trick to reduce the wall_time in the exportToDatabase module when running a large amount of image sets? The long wall_time is really slowing down my analysis.

Many thanks in advance! :slight_smile:

Hugo van Kessel

Hi @kesselhwvan,

It’d be super helpful if you could upload a copy of the pipeline so that we can take a look at what’s going on and the settings you used. Is it possible that you’re capturing very large numbers of objects or measurements?

Hi @DStirling

The pipeline: 20211703_HVK001_CP413_pipeline_v1.cppipe (21.4 KB)

To be more precise, i am running ~50.000 imagesets containing 4 images each. This experiment contained 3 replicates, 3 timepoints and 9 plates, i.e. 1 replicate contains 27 plates (3 timepoints times 9 plates = 27). When i run the pipeline with all the images the estimated runtime is approx. 150-160 hours but when i run only one plate which contains 616 image sets it only takes 30 minutes. therefore the entire set should only take 3 rep * 27 plate * 0.5 h = 40.5 hours and not 150+ hours. It seems that putting all the images in CP at once is messing somehow with the wall_time in the ExportToDatabse module when writing an SQLite database type.

Hugo van Kessel

I can’t say that we’ve extensively tested running that many in a single instance of CellProfiler; typically we would break a set of that size up and run it on a cluster. For our group, we typically have each plate as it’s own SQLite.

I’m not certain to what degree the slowdown is to do with something we can avoid or not; is the slowdown starting right away when many files are loaded, as it seems from your screenshot, or does it only kick in during later image sets? If it’s slow right from the beginning, that seems like it’s more likely a CellProfiler specific issue; if it’s not slow until later, it might or might not be CellProfiler specific vs just “updating really big databases is slow”.

One other thing that can effect performance- can you confirm in both the cases you mentioned, the fast and the slow one, that you were running the same number of workers in both?

Hi @bcimini

I can confirm i was running the same number of workers. I slowdown starts immediately after starting the pipeline. What I have done now to solve it (image below), use CreateBatchFile module to create batch commands per replicate and run those simultaneously through three separate terminals. I think the issue lies within the ExportToDatabase module when writing to the SQLite database there is some sort of a lock. My solution now ofcourse is somewhat temporal because it does not use multiple workers per terminal. However, running the replicates separate on 3 individual workers is faster than running it using multiple workers in one run.