CellProfiler headless with multipage tif

We are trying to run cellprofiler with multipage tiff that has 4 frames.

In the Metadata module, metadata extraction method is from image file headers

In NameAndTypes,assign a name to “All images”

I attached the cppipe and the image for testing purposes.

In UI standalone CP, we exported the image csv file and we find 4 lines in it(one per frame).

We exported the cppipe and run in headless mode. merged_image.tif is in “input” folder

cellprofiler -r -c -i input -p test2.cppipe -o output

but we only get one line in image.csv file, corresponding to the first frame

Are we missing some options in the headless mode so that it behaves differently

test2.cppipe (12.4 KB) merged_image.tif (1.3 MB)

1 Like

Hi @Yi_Sun,

The Image.CSV file generates 1 row per image, not per frame. I ran your pipeline with the one image you provided and the Image.CSV file generated 1 row.
I think the problem is coming from your NamesAndTypes module and the setting for “Image set matching method”. Instead of “Order”, you would want to use “Metadata” and something that could separate each image, yet keep the 4 frame per image rule.


I only have one image here, so I can’t say which one work better for all of your image set. So, in your pipeline, with more images loaded, take a look at:

and see what metadata info you can use to follow the 4 channel per image rule and be able to separate each image.

I hope this is clear. If you had trouble, feel free to upload more images here and I can adjust your pipeline.

Thank you for your reply, @Nasim. When we update the tables in “Metadata” or “NamesAndTypes” we see that each frame in the file is treated separately, i.e., features from each image are extracted. Using the GUI everything works as we expect, but running that pipeline from the command line as @Yi_Sun indicated only outputs the features for the first frame. Our impression is that the “Update” buttons in those modules are relevant for the rest of the pipeline and we cannot simulate that from the command line… unless there’s an option that we are missing.

Another level of complexity is that we are using multi-page tif with many frames so it’s not an option for us to input the metadata manually.

Thank you!

Could you upload a file with even just two FOV (complete in all channels), so we can validate and explore the issue on our end? More than 2 is fine, but the file uploaded previously in this thread seems to just have 4 DAPI images in it.

Our impression is that the “Update” buttons in those modules are relevant for the rest of the pipeline and we cannot simulate that from the command line… unless there’s an option that we are missing.

It is possible that the issue is precisely that, though I looked in our GitHub issues and if that is indeed the case it has not been reported before.

If that is the case, you essentially have two options until the issue is resolved -

Load everything in the way you have said seems to work for you in GUI, then use the CreateBatchFiles module to create a batch file that you can pass into CellProfiler in lieu of an input directory OR
Load everything in as above, then export an image set listing and swap your pipeline over to using LoadData and pass THAT into CellProfiler in lieu of an input directory.

Thanks for your suggestions, @bcimini.

Just to give you a bit of background: @Yi_Sun has wrapped several CellProfiler modules into Galaxy tools. For that reason, we need to:

  • Run CellProfiler in headless mode.
  • Avoid the massive I/O operations in an HPC environment. The trick here is to use a multi-page tiff as a package of unrelated images (each one in a frame) as suggested in the forum. That’s why the tif file that we uploaded before is not complete in all channels, they are just different images with the same channel.

We tried your suggestions:

  1. CreateBatchFiles module: we select No in Store batch files in the default output folder? and input the Output folder path. We are generating the h5 file on a Windows machine, but later we will need to run that on Linux. If we specify the Linux path, we get “Error: Encountered unrecoverable error in CreateBatchFiles during startup: None”. So we kept the Windows local path, but it seems to be embedded into the h5 and it fails on Linux when we run (test_batch.cppipe and Batch_data.h5 attached):
cellprofiler -c -r --get-batch-commands Batch_data.h5 -o ./output -p test_batch.cppipe

The output is:

CellProfiler -c -r -p Batch_data.h5 -f 1 -l 1
CellProfiler -c -r -p Batch_data.h5 -f 2 -l 2
CellProfiler -c -r -p Batch_data.h5 -f 3 -l 3
CellProfiler -c -r -p Batch_data.h5 -f 4 -l 4

So we run:

cellprofiler -c -r -p Batch_data.h5 -f 1 -l 1

And get:

/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/utilities/hdf5_dict.py:539: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  np.issubdtype(hdf5_type, int) or
/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/utilities/hdf5_dict.py:541: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  hdf5_type_is_float = np.issubdtype(hdf5_type, float)
Batch file default output directory, "c://users//ysun", does not exist
Batch file default input directory "c://users//ysun", does not exist
Times reported are CPU and Wall-clock times for each module
Thu May  7 12:14:48 2020: Image # 1, module Images # 1: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 12:14:48 2020: Image # 1, module Metadata # 2: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Error detected during run of module NamesAndTypes
Traceback (most recent call last):
  File "/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/pipeline.py", line 1782, in run_with_yield
    self.run_module(module, workspace)
  File "/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/pipeline.py", line 2034, in run_module
    module.run(workspace)
  File "/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/modules/namesandtypes.py", line 1737, in run
    rescale, image_set[0])
  File "/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/modules/namesandtypes.py", line 1805, in add_image_provider
    series, index, channel)
  File "/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/modules/namesandtypes.py", line 1845, in add_simple_image
    self.add_provider_measurements(provider, m, cellprofiler.measurement.IMAGE)
  File "/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/modules/namesandtypes.py", line 1859, in add_provider_measurements
    img = provider.provide_image(m)
  File "/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/modules/namesandtypes.py", line 2358, in provide_image
    image = loadimages.LoadImagesImageProviderURL.provide_image(self, image_set)
  File "/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/modules/loadimages.py", line 3573, in provide_image
    self.__set_image()
  File "/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/modules/loadimages.py", line 3514, in __set_image
    self.cache_file()
  File "/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/modules/loadimages.py", line 3418, in cache_file
    raise IOError("Test for access to directory failed. Directory: %s" % path)
IOError: Test for access to directory failed. Directory: C:/Users/ysun
Thu May  7 12:14:48 2020: Image # 1, module NamesAndTypes # 3: CPU_time = 0.23 secs, Wall_time = 0.23 secs
  1. Export the metadata with image set listing. We also changed in the CSV to have the path
    (attached the test.csv).
cellprofiler -c -r --data-file test.csv -o ./output -p test2.cppipe

And the output:

/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/utilities/hdf5_dict.py:539: FutureWarning: Conve                                               rsion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int6                                               4 == np.dtype(int).type`.
  np.issubdtype(hdf5_type, int) or
/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/utilities/hdf5_dict.py:541: FutureWarning: Conve                                               rsion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64                                                == np.dtype(float).type`.
  hdf5_type_is_float = np.issubdtype(hdf5_type, float)
CP-JAVA 11:22:49.119 [Thread-0] WARN  o.c.imageset.ChannelFilter - Empty image set list: no images passed the filtering criteria.
Failed to prepare run for module ExportToSpreadsheet
Traceback (most recent call last):
  File "/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/pipeline.py", line 2097, in prepare_run
    if ((not module.prepare_run(workspace)) or
  File "/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/modules/exporttospreadsheet.py", line 55                                               9, in prepare_run
    return self.check_overwrite(workspace)
  File "/home/galaxy/miniconda2/envs/testmultitiff/lib/python2.7/site-packages/cellprofiler/modules/exporttospreadsheet.py", line 80                                               4, in check_overwrite
    image_number = metadata_group.image_numbers[0]
IndexError: index 0 is out of bounds for axis 0 with size 0

Both options worked in the GUI as expected. Any hints on how to proceed from here? Thanks!

data.zip (20.1 KB)

1 Like

I’m really not sure, because all of these work nicely on our end. If I use your provided image, make a version of your pipeline that uses load data and a new image set CSV on my PC

cellprofiler -c -r -p ~/Downloads/data/test_load_data.cppipe --data-file ~/Downloads/data/load_data.csv -o ~/Downloads/data/

/Users/bcimini/Documents/GitHub/CellProfiler/CellProfiler/cellprofiler/utilities/hdf5_dict.py:539: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
np.issubdtype(hdf5_type, int) or
/Users/bcimini/Documents/GitHub/CellProfiler/CellProfiler/cellprofiler/utilities/hdf5_dict.py:541: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
hdf5_type_is_float = np.issubdtype(hdf5_type, float)
Times reported are CPU and Wall-clock times for each module
Thu May 7 08:32:48 2020: Image # 1, module LoadData # 1: CPU_time = 1.01 secs, Wall_time = 0.67 secs
/Users/bcimini/Documents/GitHub/CellProfiler/centrosome/centrosome/cpmorphology.py:4209: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
big_labels[[slice(fe,-fe) for fe in footprint_extent]] = labels
/Users/bcimini/Documents/GitHub/CellProfiler/centrosome/centrosome/cpmorphology.py:416: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
index_i, index_j, image = prepare_for_index_lookup(image, False)
/usr/local/lib/python2.7/site-packages/skimage/util/arraycrop.py:175: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
cropped = np.array(ar[slices], order=order, copy=True)
Thu May 7 08:32:49 2020: Image # 1, module IdentifyPrimaryObjects # 2: CPU_time = 0.34 secs, Wall_time = 0.33 secs
Thu May 7 08:32:49 2020: Image # 1, module MeasureObjectIntensity # 3: CPU_time = 0.11 secs, Wall_time = 0.11 secs
Thu May 7 08:32:50 2020: Image # 1, module ExportToSpreadsheet # 4: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May 7 08:32:50 2020: Image # 2, module LoadData # 1: CPU_time = 0.07 secs, Wall_time = 0.04 secs
Thu May 7 08:32:50 2020: Image # 2, module IdentifyPrimaryObjects # 2: CPU_time = 0.29 secs, Wall_time = 0.29 secs
Thu May 7 08:32:50 2020: Image # 2, module MeasureObjectIntensity # 3: CPU_time = 0.10 secs, Wall_time = 0.10 secs
Thu May 7 08:32:50 2020: Image # 2, module ExportToSpreadsheet # 4: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May 7 08:32:50 2020: Image # 3, module LoadData # 1: CPU_time = 0.02 secs, Wall_time = 0.02 secs
Thu May 7 08:32:50 2020: Image # 3, module IdentifyPrimaryObjects # 2: CPU_time = 0.27 secs, Wall_time = 0.28 secs
Thu May 7 08:32:50 2020: Image # 3, module MeasureObjectIntensity # 3: CPU_time = 0.10 secs, Wall_time = 0.10 secs
Thu May 7 08:32:50 2020: Image # 3, module ExportToSpreadsheet # 4: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May 7 08:32:50 2020: Image # 4, module LoadData # 1: CPU_time = 0.03 secs, Wall_time = 0.02 secs
Thu May 7 08:32:50 2020: Image # 4, module IdentifyPrimaryObjects # 2: CPU_time = 0.28 secs, Wall_time = 0.29 secs
Thu May 7 08:32:51 2020: Image # 4, module MeasureObjectIntensity # 3: CPU_time = 0.10 secs, Wall_time = 0.10 secs
Thu May 7 08:32:51 2020: Image # 4, module ExportToSpreadsheet # 4: CPU_time = 0.00 secs, Wall_time = 0.00 secs

If I make a batch file, EITHER by not path mapping at all or by just mapping a directory to itself, that also works fine
cellprofiler -c -r -p ~/Downloads/data/Batch_data.h5 -o ~/Downloads/data/

/Users/bcimini/Documents/GitHub/CellProfiler/CellProfiler/cellprofiler/utilities/hdf5_dict.py:539: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  np.issubdtype(hdf5_type, int) or
/Users/bcimini/Documents/GitHub/CellProfiler/CellProfiler/cellprofiler/utilities/hdf5_dict.py:541: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  hdf5_type_is_float = np.issubdtype(hdf5_type, float)
Times reported are CPU and Wall-clock times for each module
Thu May  7 08:28:44 2020: Image # 1, module Images # 1: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:44 2020: Image # 1, module Metadata # 2: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:44 2020: Image # 1, module NamesAndTypes # 3: CPU_time = 1.13 secs, Wall_time = 0.71 secs
Thu May  7 08:28:45 2020: Image # 1, module Groups # 4: CPU_time = 0.00 secs, Wall_time = 0.00 secs
/Users/bcimini/Documents/GitHub/CellProfiler/centrosome/centrosome/cpmorphology.py:4209: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  big_labels[[slice(fe,-fe) for fe in footprint_extent]] = labels
/Users/bcimini/Documents/GitHub/CellProfiler/centrosome/centrosome/cpmorphology.py:416: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  index_i, index_j, image = prepare_for_index_lookup(image, False)
/usr/local/lib/python2.7/site-packages/skimage/util/arraycrop.py:175: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  cropped = np.array(ar[slices], order=order, copy=True)
Thu May  7 08:28:45 2020: Image # 1, module IdentifyPrimaryObjects # 5: CPU_time = 0.33 secs, Wall_time = 0.33 secs
Thu May  7 08:28:45 2020: Image # 1, module MeasureObjectIntensity # 6: CPU_time = 0.11 secs, Wall_time = 0.11 secs
Thu May  7 08:28:45 2020: Image # 1, module ExportToSpreadsheet # 7: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:45 2020: Image # 1, module CreateBatchFiles # 8: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:45 2020: Image # 2, module Images # 1: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:45 2020: Image # 2, module Metadata # 2: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:45 2020: Image # 2, module NamesAndTypes # 3: CPU_time = 0.08 secs, Wall_time = 0.03 secs
Thu May  7 08:28:45 2020: Image # 2, module Groups # 4: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:45 2020: Image # 2, module IdentifyPrimaryObjects # 5: CPU_time = 0.39 secs, Wall_time = 0.29 secs
Thu May  7 08:28:45 2020: Image # 2, module MeasureObjectIntensity # 6: CPU_time = 0.10 secs, Wall_time = 0.10 secs
Thu May  7 08:28:46 2020: Image # 2, module ExportToSpreadsheet # 7: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:46 2020: Image # 2, module CreateBatchFiles # 8: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:46 2020: Image # 3, module Images # 1: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:46 2020: Image # 3, module Metadata # 2: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:46 2020: Image # 3, module NamesAndTypes # 3: CPU_time = 0.04 secs, Wall_time = 0.02 secs
Thu May  7 08:28:46 2020: Image # 3, module Groups # 4: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:46 2020: Image # 3, module IdentifyPrimaryObjects # 5: CPU_time = 0.28 secs, Wall_time = 0.27 secs
Thu May  7 08:28:46 2020: Image # 3, module MeasureObjectIntensity # 6: CPU_time = 0.09 secs, Wall_time = 0.10 secs
Thu May  7 08:28:46 2020: Image # 3, module ExportToSpreadsheet # 7: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:46 2020: Image # 3, module CreateBatchFiles # 8: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:46 2020: Image # 4, module Images # 1: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:46 2020: Image # 4, module Metadata # 2: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:46 2020: Image # 4, module NamesAndTypes # 3: CPU_time = 0.04 secs, Wall_time = 0.02 secs
Thu May  7 08:28:46 2020: Image # 4, module Groups # 4: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:46 2020: Image # 4, module IdentifyPrimaryObjects # 5: CPU_time = 0.26 secs, Wall_time = 0.27 secs
Thu May  7 08:28:46 2020: Image # 4, module MeasureObjectIntensity # 6: CPU_time = 0.10 secs, Wall_time = 0.10 secs
Thu May  7 08:28:46 2020: Image # 4, module ExportToSpreadsheet # 7: CPU_time = 0.00 secs, Wall_time = 0.00 secs
Thu May  7 08:28:46 2020: Image # 4, module CreateBatchFiles # 8: CPU_time = 0.00 secs, Wall_time = 0.00 secs

I also (not shown) tested both of those commands with -f -l flags, they also returned the expected results.

We are generating the h5 file on a Windows machine, but later we will need to run that on Linux. If we specify the Linux path, we get “Error: Encountered unrecoverable error in CreateBatchFiles during startup: None” . So we kept the Windows local path, but it seems to be embedded into the h5 and it fails on Linux when we run (test_batch.cppipe and Batch_data.h5 attached)

Yes, the major point of the CreateBatchFiles module is to update the paths from a local machine (say on Windows) to a remote linux cluster, so if you use the Windows paths in both places indeed it will not work. I don’t know why you’re getting the stated error, the version of the pipeline you sent just has a self-mapping linux-to-linux, so maybe a screenshot of the version you’re having issues with would help. I do notice your commands were missing a -o, just fyi.

There’s nothing obviously wrong with your csv that I can see, but again without seeing how you had load_data set up there’s possibly something there I can’t debug.

We do have a recent video tutorial for running in batch, if you think that might help you debug a bit more easily- it does include a section on setting up CreateBatchFiles from a local Mac to a remote linux machine.

Archive.zip (6.9 KB)

2 Likes

Thank you for all the help and testing. thank you very much @bcimini

With test_load_data.cppipe works for us too.

We have initially include the four starting modules for the metadata extraction instead of the LoadData module because we read that it is a legacy module.

However, we are trying to run everything in Galaxy and automatically creating load_data.csv is tricky because the headers include the names of the image (for example, DNA).

We are trying to find a workaround to get the those names.

1 Like

We have initially include the four starting modules for the metadata extraction instead of the LoadData module because we read that it is a legacy module.

It is legacy, and our original plan was to deprecate it, but we realized it’s used by too many folks and has too many uses. That deprecation warning is going to be removed in CP4.

However, we are trying to run everything in Galaxy and automatically creating load_data.csv is tricky because the headers include the names of the image (for example, DNA).
We are trying to find a workaround to get the those names.

Can you explain more specifically what is tricky here? We do this pretty routinely, with a couple of different script “flavors”.

we couldn’t export Image set listing in Galaxy environment, so we were trying to build this load_data.csv by hand programatically.
We noticed the headers in that file, contain names from the other modules, for example, Frame_DNA, Channel_DNA etc…
we found that the ‘DNA’ part is user input from the IdentifyPrimaryObject module, so we are not sure in Galaxy environment how do we read this value and build the csv headers…

any advice and help are much appreciated.

thank you

The “DNA” part shouldn’t be from IdentifyPrimaryObjects, it’s based on the channel names from the NamesAndTypes module- at a minimum, your CSV has to contain PathName_{Channel} and FileName_{Channel}, for every {Channel} that had previously been identified in NamesAndTypes; since your stuff is multipage, it will need some additional info, such as Series_{Channel} or Frame_{Channel}, depending. Exactly how all this is used depends on the exact makeup of your files- do you have 1 file per channel, with each file having N sites in it? 1 file per site, with each file having M channels in it? 1 file per experiment, with each file having N * M sites and channels in it? Here’s a snapshot of a CSV from the last case- in this case, Frame refers to channel, and Series refers to site, but exactly how to break yours down depends on exactly how your metadata is coded.

image

Your best bet is to download one COMPLETE representative set of files, load it into the GUI, export a CSV based on CellProfiler’s breakdown of it, then use that as a template- there are too many possible cases for me to advise you how to do it a priori.

1 Like