Improving CellProfiler Speed




I have been running a CellProfiler3.0 pipeline locally on my laptop for some time. For a typical 96-well plate (4 fields, 3 channels), it takes around 3-5 hours to run. We recently upgraded to 384-well plates in hopes of starting screening, and it’s taking well over 20 hours/plate. The program always indicates that it is running 4 workers, and I have done as much as I can to shorten the processing time (hiding all windows on run, reducing measurements collected, etc). I am wondering what would be the best way to improve processing time. We are open to getting a new desktop if necessary, or running on the cloud if that is feasible (my understanding of cloud computing is minimal). Thank you very much!


Without knowing really anything about your setup or pipeline I can’t really give good advice other than

  • Make sure you’re using ExportToDatabase (in SQLite mode is a good place to start) not ExportToSpreadsheet
  • Give CellProfiler a temp directory with lots of file space.

A dedicated image processing machine with plenty of RAM and disk space isn’t a bad idea if you’re planning to do screening; if you want to try running in the cloud there are ways to do that but whether it’s better to do that vs buy a new machine depends on a lot of factors (your comfort with executing things from the terminal, whether you’d rather spend a lot of money up-front vs smaller amounts over time, exactly how much data you want to run and how complex your pipeline is).


I can attach my pipeline if that could help identify the best solution. I am completely unfamiliar with running programs in the cloud, but if it is simple to learn, I am open to it.


@bcimini Why is exporting to a database faster than exporting to a CSV-file? I did some preliminary testing with a small image set (only 6 images), and the pipeline took 4 min 10 sec with ExportToSpreadsheet, and 5 min 30 sec with ExportToDatabase (SQLite). I realize that with such a small sample set, there might some overhead that confounds the results, and I will try with a larger samples as soon as I have figured out how to see the total run time after completion of the pipeline, so that I don’t have to wait around to time the results. But it would be interesting to hear your experience and the reason why database export would be faster.


For large sets it is faster because the measurements are written to the database as-you-go, not held in a temporary file where they then have to be written out at the end. That’s my understanding anyway.