How to retrieve list of images in CellProfiler 2.0.11429

Hi,

I’m running CellProfiler 2.0.11429 on a cluster. To create the jobs I’m using the CreateBatchFile module and run it in the GUI.
I then want to extract the list of images from the batch-pipeline-file (Batch_file.h5) using something like this:

import cellprofiler
import cellprofiler.pipeline
import cellprofiler.workspace
import cellprofiler.measurements
import cellprofiler.cpimage
pipeline = cellprofiler.pipeline.Pipeline()
pipeline.load(‘Batch_file.h5’)
image_set_list = cellprofiler.cpimage.ImageSetList()
measurements = cellprofiler.measurements.Measurements()
workspace = cellprofiler.workspace.Workspace(pipeline, None, None, None, measurements, image_set_list)
pipeline.prepare_run(workspace) # return True
grouping_keys, groups = pipeline.get_groupings(workspace)
pipeline.prepare_group(grouping_argument, groups[0][0], groups[0][1]) # I would guess that something here is wrong??
num_sets = image_set_list.count()

num_sets gives 0

If I run this on a normal (non-batch) pipeline file, it works. Can you tell me what I’m doing wrong?

Apart from this I realized that the version value in the pipeline.py module is wrong (it is set to “Revision: 11424 ”). I would like to use this to distinguish between old (11047) and newer CellProfiler versions (the above code won’t work for 11047). Does this make sense? Or is the version value deprecated? At the moment I use argspec inspection which I think is not very nice.

Another point: I’m creating a batch file for about 300.000 images. This takes quite a lot of time (~6 hours). When I looked into the code I realized that you fill up a hdf5_dict with dummy measurement values for these 300.000 images when creating the batch file in CreateBatchFile. I wonder if this is necessary or if this could just be dropped because it would speed up the whole process alot.

Thanks alot.

Cheers,
Benni

Your code looks like it should work. Maybe you could post a Batch_Data.h5 file for a small image set so I can see what’s wrong.

We’ve moved the repository to GIT (github.com/CellProfiler/CellProfiler) and we are switching to a timestamp version to replace the SVN revision. There is a new module, “version.py” that will be used to match a pipeline version against the version that was used to run it. But, for 11429 and previous, you can find the SVN revision in the pipeline header for .cp files:
CellProfiler Pipeline: cellprofiler.org
Version:1
SVNRevision:10609

or if you have loaded Batch_Data.h5 produced by CreateBatchFiles, you can use the following code (from NewBatch.py):

        for module in pipeline.modules():
            if isinstance(module,CreateBatchFiles):
                svn_revision = module.revision.value
                break

To get the currrent revision of CellProfiler (or at least the revision at the time of the last edit to one of the modules in cellprofiler/modules) using svn 11429, you can call cellprofiler.utilities.get_revision.get_revision().

If you have a CreateBatchFiles module, the pipeline will try to make the Batch_Data.h5 file (which has a pipeline in it whose CreateBatchFiles module has a switch set to tell it not to make Batch_Data.h5). A lot of people seem to go down this path of using CreateBatchFiles with a script to extract the results and often there is a simpler solution. We have the LoadData module which takes a .CSV file (see the manual or help for details). Each row of the .CSV file has the file and optionally the path to the image files for one image set. For your case, you may want to make this .CSV programatically. After that, you can run a pipeline to process only a part of the .CSV. A typical .CSV format for two images per image set, with channels, “PI” and "GFP might be:

    "FileName_PI","PathName_PI","FileName_GFP","PathName_GFP","Treatment","Plate","Well","Site"
    "P-12345_A01_s1_w1_PI.tif","/images","P-12345_A01_s1_w2_GFP.tif","/images","DMSO","P-12345","A01","s1"
    "P-12345_A01_s2_w1_PI.tif","/images","P-12345_A01_s2_w2_GFP.tif","/images","DMSO","P-12345","A01","s2"
    "P-12345_A02_s1_w1_PI.tif","/images","P-12345_A02_s1_w2_GFP.tif","/images","Amoxycillin","P-12345","A02","s1"
    "P-12345_A02_s2_w1_PI.tif","/images","P-12345_A02_s2_w2_GFP.tif","/images","Amoxycillin","P-12345","A02","s2"

There are several ways to run your experiment in batches without CreateBatchFiles:

  • Tell CellProfiler to run only a range of image sets per batch: cellprofiler -f 1 -l 10 -p <pipeline> -i <input-directory> -o <output-directory> to run image sets 1-10 from your 300,000 image set .CSV. This has the advantage of properly numbering the image sets so that ExportToDatabase can create rows with correct image numbers, but has the disadvantage that this large file must be parsed on startup by each instance and, as you say, written to the .h5

  • Break your .CSV into parts and store each of them with the same name in different output directories. Specify the output directory on the command-line. There is the difficulty here of reworking the image numbers in the database - you may have to create a script to fix it up. An alterative is to have your script add a batch number to the image table and and object table(s) so that the combination of ImageNumber and batch number uniquely identifies an image set.

  • Similar to the above, create different .CSVs and edit your pipeline (.cp) file programatically to change the file name:

  CellProfiler Pipeline: http://www.cellprofiler.org
  Version:2
  SVNRevision:11304

  LoadData:[module_num:1|svn_version:\'Unknown\'|variable_revision_number:6|show_window:True|notes:\x5B\'Load a .csv file containing additional image metadata. This file contains the file and path names of the illumination correction functions that are to be load and used, as well as dosages applied and identity of the controls.\'\x5D|batch_state:array(\x5B\x5D, dtype=uint8)]
      Input data file location:Default Input Folder\x7C.
      Name of the file:1049_FilenamesAndMetadata_short.csv
      Load images based on this data?:Yes
      Base image location:Default Input Folder\x7C.
      Process just a range of rows?:No
      Rows to process:1,12
      Group images by metadata?:No
      Select metadata fields for grouping:
      Rescale intensities?:Yes

I think you’re right about the cost of writing all 300,000 image sets to the .h5. The latest code base has a fix to make this faster, but still, you’re paying the cost many more times than you really have to and perhaps that’s something we have to address.

Hi,

thanks for this. I tried it with

measurements = cellprofiler.measurements.load_measurements(‘Batch_data.h5’)
measurements.get_image_numbers()
as Ray suggested on the mailing list. That seems to work, though I’m not sure how to retrieve the grouping information in this way, but for now I don’t need that anyway.

I’m actually already loading the data from a csv file. Maybe splitting it up into multiple parts would have been the easier way, though less cleaner :smile: I’ll fall back to this when I can’t get it to work with CreateBatchFile.

Cheers for the detailed answer!
Benni