Your code looks like it should work. Maybe you could post a Batch_Data.h5 file for a small image set so I can see what’s wrong.
We’ve moved the repository to GIT (github.com/CellProfiler/CellProfiler) and we are switching to a timestamp version to replace the SVN revision. There is a new module, “version.py” that will be used to match a pipeline version against the version that was used to run it. But, for 11429 and previous, you can find the SVN revision in the pipeline header for .cp files:
CellProfiler Pipeline: cellprofiler.org
or if you have loaded Batch_Data.h5 produced by CreateBatchFiles, you can use the following code (from NewBatch.py):
for module in pipeline.modules():
svn_revision = module.revision.value
To get the currrent revision of CellProfiler (or at least the revision at the time of the last edit to one of the modules in cellprofiler/modules) using svn 11429, you can call cellprofiler.utilities.get_revision.get_revision().
If you have a CreateBatchFiles module, the pipeline will try to make the Batch_Data.h5 file (which has a pipeline in it whose CreateBatchFiles module has a switch set to tell it not to make Batch_Data.h5). A lot of people seem to go down this path of using CreateBatchFiles with a script to extract the results and often there is a simpler solution. We have the LoadData module which takes a .CSV file (see the manual or help for details). Each row of the .CSV file has the file and optionally the path to the image files for one image set. For your case, you may want to make this .CSV programatically. After that, you can run a pipeline to process only a part of the .CSV. A typical .CSV format for two images per image set, with channels, “PI” and "GFP might be:
There are several ways to run your experiment in batches without CreateBatchFiles:
Tell CellProfiler to run only a range of image sets per batch:
cellprofiler -f 1 -l 10 -p <pipeline> -i <input-directory> -o <output-directory> to run image sets 1-10 from your 300,000 image set .CSV. This has the advantage of properly numbering the image sets so that ExportToDatabase can create rows with correct image numbers, but has the disadvantage that this large file must be parsed on startup by each instance and, as you say, written to the .h5
Break your .CSV into parts and store each of them with the same name in different output directories. Specify the output directory on the command-line. There is the difficulty here of reworking the image numbers in the database - you may have to create a script to fix it up. An alterative is to have your script add a batch number to the image table and and object table(s) so that the combination of ImageNumber and batch number uniquely identifies an image set.
Similar to the above, create different .CSVs and edit your pipeline (.cp) file programatically to change the file name:
CellProfiler Pipeline: http://www.cellprofiler.org
LoadData:[module_num:1|svn_version:\'Unknown\'|variable_revision_number:6|show_window:True|notes:\x5B\'Load a .csv file containing additional image metadata. This file contains the file and path names of the illumination correction functions that are to be load and used, as well as dosages applied and identity of the controls.\'\x5D|batch_state:array(\x5B\x5D, dtype=uint8)]
Input data file location:Default Input Folder\x7C.
Name of the file:1049_FilenamesAndMetadata_short.csv
Load images based on this data?:Yes
Base image location:Default Input Folder\x7C.
Process just a range of rows?:No
Rows to process:1,12
Group images by metadata?:No
Select metadata fields for grouping:
I think you’re right about the cost of writing all 300,000 image sets to the .h5. The latest code base has a fix to make this faster, but still, you’re paying the cost many more times than you really have to and perhaps that’s something we have to address.