Adjusting/preparing In/Out folder while creating the pipeline for batch-processing

Dear all,

I’m generating a cellprofiler pipeline which I will use in a cluster for batch processing. Since this pipeline has to be made with the GUI version, We can just drag and drop the folder and add the necessary modules for creating the pipelines. But my In/out folder in the cluster will be different from the folder I will be on my computer. So, What change do I have to make to the default in/out folder?
In the module CreateBatchFiles, there is an option to choose the output folder path. I gave one of the cluster folder paths for that purpose.

For convenience purposes, I’m creating the above pipeline with only 3 images but I will batch process around 8000 images. Will it pose any compatibility issue?

Hi @mmajumde,

For server, you can try using “LoadData” module that would take of your input module details that you use in GUI.

Regards,
Lakshmi
www.wakoautomation.com

Hello,

First of all, just want to make sure you’ve seen our documentation and our YouTube video about headless processing, in case those help!

Briefly:

  • It doesn’t matter what you set DefaultInput and DefaultOutput to on the GUI machine you’re using to set up the pipeline; you’ll need to set them at runtime on your cluster with -i and -o . Think of them as variables.
  • If you want to use CreateBatchFiles, it will only create a batch file for the files that are loaded into the GUI when the pipeline is executed there, so having 3 files in your GUI won’t work. The paths specified there are supposed to be find-and-replace boxes SPECIFICALLY for loaded files.
  • To get it to load in all of your files, you can load them all into your GUI and use CreateBatchFiles OR you can use LoadData as @lakshmi suggested if it’s straightforward for you to construct a CSV, OR the --file-list option (see the documentation I linked), OR just point the -i option at cluster-runtime to the folder where your images are and let CellProfiler figure it out (which takes the most processing time, since you need to let each copy of CellProfiler assess the whole folder, but doesn’t require you spending YOUR time doing anything).
2 Likes

I’ve been running batch jobs on a local non-cluster server and have seen some issues to get the correct output path:

My normal assay development pipelines use “Default output sub-folder”, with then some partial path coded in relevant to the project, finished with a couple metadata subfolders for organization
I don’t use LoadData or other modules to load my images. I just let each job on the server read the folder it got assigned and make its own list (takes just a couple minutes for 2000 images).
I’ve also had to be careful with the output file location since the batch behavior seems to be different between CP3 and CP4.
In CP4.0.7

  1. Use “Default Output Sub-folder” for the output file location in the GUI and provide a hardcoded path that ends in the folder metadata variables: SCFOO_PI\<Library>\<Date>\<Plate>.
  2. in the batch .cmd file that runs the pipeline, -o "some path" has given me mixed results. With CP4.0.7, I need to set -o X:, which then apparently gets used as the “Default output folder” equivalent and prepended to my partial path above.

Because I need to be able to use my pipeline in both batch headless and GUI modes, I need that partial path information.

1 Like

Hello All, Thanks all of you for your kind answer.

Hey Beth, I read the documentation and the youtube video you shared. These links were really helpful. I ran cellprofiler GUI on my ~2045 images. It ran without showing any error. Data from all the modules were generated. But when I ran the same simulation headlessly, apart from a DB file nothing else generated. and I got an error and the simulation stopped.

Command:

cellprofiler -c -r -p DapiAll.cpproj -o output/ -i Dapi/

Error Message:

MyExpt_Per_Image and MyExpt_Per_Object tables already in the database and overwrite not allowed. Exiting

My pipeline is below:
DapiAll.cpproj (1.8 MB)

Thanks again for the help.

Hi @mmajumde,

I’m glad the documentation and video tutorials were helpful.

Based on your error message, it seems that there’s already a database in your output folder that contains data (a MyExpt_Per_Image table and a MyExpt_Per_Object table). The pipeline that you shared is configured to never overwrite the database, which is why CellProfiler exited:

Screen Shot 2021-04-26 at 7.13.57 AM

The corresponding help for the “Overwrite without warning?” parameter:
Screen Shot 2021-04-26 at 7.20.09 AM

You can either:

  • change this parameter to allow overwriting of the database or
  • create a new output folder in order to save the output from this run of CellProfiler in a different location and have access to both the original output from the CellProfiler GUI and the output from the headless run

Note that if you do change the “Overwrite without warning?” parameter, you’ll also need to change the “Overwrite existing files without warning?” parameter on the SaveImages module.

Hope this helps. Good luck!
Pearl

Hey @pearl-ryder, thanks for the answer. I made the changes accordingly and got the following errors. I didn’t understand the reason for the final OSError since we are providing the input and output folder runtime. Directory: F://Dapi All is the input directory on my local computer. I created the pipeline in my local machine using this as my input folder. Could you please give me an idea what might be the reason for that?

Can you use the .cppipe rather than the .cpproj? Cpproj files include input image information, and cppipe files don’t, which makes them safer to run on a different machine.

Hello @bcimini, Thanks a lot for the reply. I followed your lead and changed the file type from .cpproj to .cppipe. Now, I’m getting the following error.

The image files, I’m using are .tiff type.

For this error, the first thing I would check is if the input modules in your pipeline.cppipe file are able to create image sets for the images in your input folder. For that, I’d return to the CellProfiler GUI, upload that specific pipeline, and add at least one image set from your input folder. Then I’d check to see if you’re able to run the first step of the module in test mode. You may find an error in the input modules that needs adjusting so that image sets are configured correctly.

Let us know how that goes and good luck!

Hello @pearl-ryder, Thanks for the reply. I ran the CellProfiler GUI in testing mode step by step with the pipeline. It doesn’t show any error. And the image sets are configured correctly.

Just a question to ask. you will see in my uploaded image that the output DB file location is set to a location in my local machine. Can this be the reason for the problem? But again, in headless mode, we set the in/output folder in runtime.

Thanks a ton again.

From whatever terminal + location you were running cellprofiler -c -r -p pipeline.cppipe -i Dapi -o output/, can you run ls Dapi | head ?

But also, to directly answer your question, you’re correct that since you set input and output at runtime, ‘Default Output Folder’ showing a local location when you’re in your GUI is fine; Default Output Folder, which is the same thing as -o, is just a variable, so it’s fine for it to be set to different things in different places.

Running the above command shows 10 files name from the Dapi folder.

Running the above command shows 10 files name from the Dapi folder.

Yes, can you post a screenshot or copy in the text? I am hoping to see if there is any obvious reason why CellProfiler might not like those files.

The only thing I noticed that might potentially cause an issue in the project file you previously posted (which I assume is materially similar to your pipeline.cppipe, but perhaps not entirely because it looks like you’ve removed a module in your most recent screenshot? If you could upload pipeline.cppipe to help us make sure that could possibly help) is that you have the Metadata module turned on but not configured correctly (aka the things it’s trying to extract from the file name are not being extracted); as far as I know that shouldn’t block headless execution, but it’s possible that it does and I just don’t know it. If you currently have Metadata on, can you try turning it off, resaving your pipeline file, and seeing if that helps?

  1. I got the following output writing ls Dapi | head command:

  1. I ran the pipeline again setting the Extract metadata to No. I still gets the error I got previously.

  1. And to address the question “V The only thing I noticed that might potentially cause an issue in the project file you previously posted (which I assume is materially similar to your pipeline.cppipe , but perhaps not entirely because it looks like you’ve removed a module in your most recent screenshot?” I only removed the SaveImages module, Since It increases the simulations hour in my laptop.

  2. I’m sharing my pipeline you ask for.

pipeline.cppipe (8.3 KB)

Thanks a lot again for the help.

Hmm, ok, the last thing I can think of is to take one of your images and create a copy in the same folder with a new name without spaces or brackets (ie Patient053-ID-M-Position0-2145_XY1606331728_Z0_T000_C00000.tif), if that copied image (and only that copied image) now runs.

In GUI mode, CellProfiler can typically handle “non-ideal-file-name-characters” like spaces and brackets, but it’s possible that there is a bug in headless mode where they aren’t being handled nicely. In that case, presuming you don’t want to rename all your files (which I presume you do not), what I’d say is that you should plan to run with one of the other input modes besides just passing in -i (--data-file or --file-list or with a batchfile; you should be able to make a file for --file-list that should work just with ls Dapi >> filelist.txt).

Hello @bcimini, I think the problem was with the filename. Once I change my file name to something without any space, it ran properly. But I’m facing issue with the ExportToSpreadsheet module.

I uploaded my new pipeline with the ExportToSpreadsheet module.

Dapi-Subset.cpproj (467.8 KB)

Again, Thanks a lot for the help.

Can you post your pip freeze? Wondering if there’s a numpy version incompatibility

In case, if you are only interested to know the NumPy version, it is 1.20.2. following are the output of

pip freeze command: