Improving CellProfiler Speed

Hello,

I have been running a CellProfiler3.0 pipeline locally on my laptop for some time. For a typical 96-well plate (4 fields, 3 channels), it takes around 3-5 hours to run. We recently upgraded to 384-well plates in hopes of starting screening, and it’s taking well over 20 hours/plate. The program always indicates that it is running 4 workers, and I have done as much as I can to shorten the processing time (hiding all windows on run, reducing measurements collected, etc). I am wondering what would be the best way to improve processing time. We are open to getting a new desktop if necessary, or running on the cloud if that is feasible (my understanding of cloud computing is minimal). Thank you very much!

Without knowing really anything about your setup or pipeline I can’t really give good advice other than

  • Make sure you’re using ExportToDatabase (in SQLite mode is a good place to start) not ExportToSpreadsheet
  • Give CellProfiler a temp directory with lots of file space.

A dedicated image processing machine with plenty of RAM and disk space isn’t a bad idea if you’re planning to do screening; if you want to try running in the cloud there are ways to do that but whether it’s better to do that vs buy a new machine depends on a lot of factors (your comfort with executing things from the terminal, whether you’d rather spend a lot of money up-front vs smaller amounts over time, exactly how much data you want to run and how complex your pipeline is).

I can attach my pipeline if that could help identify the best solution. I am completely unfamiliar with running programs in the cloud, but if it is simple to learn, I am open to it.

@bcimini Why is exporting to a database faster than exporting to a CSV-file? I did some preliminary testing with a small image set (only 6 images), and the pipeline took 4 min 10 sec with ExportToSpreadsheet, and 5 min 30 sec with ExportToDatabase (SQLite). I realize that with such a small sample set, there might some overhead that confounds the results, and I will try with a larger samples as soon as I have figured out how to see the total run time after completion of the pipeline, so that I don’t have to wait around to time the results. But it would be interesting to hear your experience and the reason why database export would be faster.

Hi,
For large sets it is faster because the measurements are written to the database as-you-go, not held in a temporary file where they then have to be written out at the end. That’s my understanding anyway.

Just a follow up question when trying to use ExportToDatabase using SQLight I get the error MySQL Error: maximum columns reached… I am a newbie using CellProfiler and therefore do not know all the information needed to run MySQL. And the speed does not really change using MySQL/CVS. Hence, I am wondering how else to speed things up since I do not get any other error messages and if I wait for the analysis to finish I do get the desired outputs.

Hi @Cecile_Meier-Scherli,

Could you confirm whether you’re using MySQL or SQLite? If you have excessive numbers of columns you may want to try using the module in “One table per object” mode rather than “combined object table”.

In terms of speed, there may be some limitations in how fast the data can actually be written into the database. With large output files it may just be necessary to give it time to perform these operations.

I had the error using SQLite… I will try your recommendation, thank you!

When trying to use SQLite with “one table per object”, I get the attention sign next to the “eye symbol”. The error says: You will have to merge the seperate object tables in order to use CellProfiler Analyst fully, or you will be restricted to only one object’s data at a time in CPA. Choose “Single Obj table” or “Single Obj. View” to wrote a single obj table…

If I use “Single Obj View” I get the following error message: Do you want ExportToDatabase to drop the MyExpt2_Per_cells, MyExprt2_Per_Cytoplasm and MyExprt2_Per_Nuclei tables?.. It then explains that if I choose YES, it would create the tables discarding all existing data. If I choose NO it would keep the existing data and overwrite data as necessary…

Do you need to consider multiple object types at the same time in CellProfiler Analyst e.g. Nuclei and cells)? If not, the separate object tables aren’t a problem.

Regarding dropping tables, this happens because the database file you selected already exists. It’s asking whether you want to start a new table. It might be best to start a new database file.

Hi @DStirling

I tried to chnage the database name but had no success. I have attached my project below… It successfully works using ExportToSpreadsheet but either way it takes forever… For only 6 image sets 1-3h…
Project_with_all_pipelines_version3.cpproj (194.5 KB)

Thanks @Cecile_Meier-Scherli,

Looking at your pipeline, it looks like you have a very large number of data columns which is then hitting the limit that an SQLite database can handle. It looks like this is only a problem with the per-image table in your current pipeline. A workaround would be to disable some or all of the ‘calculate per-object means/medians/standard deviations’ options. With all three enabled you’re effectively quadrupling the number of columns you try to generate.

Sorry, i forgot.I have attached the analysis pipeline. I disabled all of the “calculate per-object means/medians/sd.” It now says that “NucleiOutlines is not one of OrigDNA,…”.

001001-1-001001001.tif (2.2 MB) 001001-1-001001002.tif (2.2 MB) 001001-1-001001003.tif (2.2 MB) 001001-1-001001004.tif (2.2 MB) 001001-1-001001005.tif (2.2 MB) 001001-2-001001001.tif (2.2 MB) 001001-2-001001002.tif (2.2 MB) 001001-2-001001003.tif (2.2 MB) 001001-2-001001004.tif (2.2 MB) 001001-2-001001005.tif (2.2 MB) 001001-3-001001001.tif (2.2 MB) 001001-3-001001002.tif (2.2 MB) 001001-3-001001003.tif (2.2 MB) 001001-3-001001004.tif (2.2 MB) 001001-3-001001005.tif (2.2 MB) 001001-4-001001001.tif (2.2 MB) 001001-4-001001002.tif (2.2 MB) 001001-4-001001003.tif (2.2 MB) 001001-4-001001004.tif (2.2 MB) 001001-4-001001005.tif (2.2 MB) 001001-5-001001001.tif (2.2 MB) 001001-5-001001002.tif (2.2 MB) 001001-5-001001003.tif (2.2 MB) 001001-5-001001004.tif (2.2 MB) 001001-5-001001005.tif (2.2 MB) 001001-6-001001001.tif (2.2 MB) 001001-6-001001002.tif (2.2 MB) 001001-6-001001003.tif (2.2 MB) 001001-6-001001004.tif (2.2 MB) 001001-6-001001005.tif (2.2 MB)

analysis_full.cppipe (55.7 KB)

Did you remove a module or rename an output? A ‘missing image’ error is unrelated to Export problems.

Hi @DStirling
Yes I have changed a few things… I have attached an unaltered analysis pipeline. In this pipeline I have only deleted the last module, CreateBatchFile. While running this on CellProfiler 2.2 I have encountered an error (attached). It now says that for only 144 image sets it will take 27h…

The images I used are the same as above but the Plate_illum*.mat files are from a bigger data range.

1_IllumAGP.mat (4.4 MB) 1_IllumDNA.mat (4.4 MB) 1_IllumER.mat (4.4 MB) 1_IllumMito.mat (4.4 MB) 1_IllumRNA.mat (4.4 MB) analysis.cppipe (39.0 KB)

That error is common when running large groups of images in CellProfiler 2.2; you can try to avoid it by closing all the eyes next to the modules, but above a certain number of images it’s impossible to avoid entirely (the exact number depends on the number of modules). You can try moving to CP3 to avoid the issue to some degree.