Discrepancy in image count objects

Dear CellProfiler developers,

After careful manual inspection I believe to have encountered a bug/error with possibly large consequences.

in my pipeline I do (among other stuff) the following:

  1. I create a binary image of my original image
  2. I feed this image to the “IdentifyPrimaryObjects” module using the “manual” thresholding method, with threshold set to 0.1
  3. I save the objects myobject as png image
  4. I save the data in hdf5, in the dhf5 file: foo/Image/Count_myobject every now and then the number of myobject objects is way lower or zero while this is not the case in the original image and saved myobject png images.

I use the latest stable 2.1 CP release on windows 7
The hdf5 files are too large to upload here (several GB per hdf5 file), The pipeline uses some home made modules which I could send after permission from licence holder. However the output of these home parts is a binary image so I don’t think the problem is in this direction.

I attached one of the pipelines where this is happening and also the corresponding images. The image where the count is suddenly low is: D13_2 in the 11th image/ time point channel 1: 20130922 tp53 tiffsxy112c1t11.tif

My concern with this is that the objects are lost - and also all derived objects e.g. secondary objects seeded from these primary objects and corresponding measurements. I also have examples where the image object count was zero - but nothing seemed wrong with the primaryobjects images/ original images.

Best regards,
Steven Wink
images and cell_counts txt file.zip (536 KB)
2013_09_22_TP53.cpproj (88.8 KB)

After some more digging I can safely say the objects are not lost within the hdf5 object paths - the individual objects are there and the corresponding measurements, it seems the fault is only in the foo/Image/Count_object hdf5 path.

To replicate this issue, I really would need the custom modules. Alternately, would you be able to save and post the binary images made by the custom modules (i.e., nc_segmented) so I can modify your project to load them in and use them as input into IdentifyPrimary?
-Mark

Hi Mark,

I tried to save the binary images but saving as “binary” gave a java error (object “bit” not known). I will talk to the license holder if you still think you need it after the following:

So I did a run on the relevent images, and I think I might have found something. The Image/Metadata annotation seems to be shifted. In the attached image you can see 4 hdf5 windows. From left to right: original data set “metadata_Well”, original data set “foo/Image/Count_Nuclei”, re-run D13_2 “metadata_Well”, re-run D13-2 “foo/Image/Count_Nuclei”.

I will re-run the entire data set to check if this is reproducible.

This made me re-think the “bug” - it might be that everything is shifted in (some/all?) of my datasets and I only noticed it in cases where I checked very low cell counts - I will have to verify this.

best regards
Steven


Hi Mark,

It seems to me now I cannot use the row index of the hdf5 data rows but I have to use the index tables - as sometimes sets of these are suddenly different.
Is the multi-core threading causing this? I never noticed this before.
In this case sorry about labeling this as a “bug”!

Could you please check an assumption of mine:
When reading hdf5 data that is from an object as defined by a user in CellProfiler (so e.g. not Image object but “Nuclei” or “Cytosol” etc), the index tables are not needed since I can use foo/object/imageNumber to link an object to the correct image.
However when reading object related data inside foo/Image/… (eg. foo/Image/Count_Nuclei) I do need to use the index tables.

It would feel safer to always use the index tables, but this is going to slow down the whole proces alot since each image index has a range of objects which will have to be looped through.

Best regards,

Steven

That is correct; according to the h5py documentation (docs.h5py.org/en/latest/high/dataset.html):

An HDF5 dataset created with the default settings will be contiguous; in other words, laid out on disk in traditional C order. Datasets may also be created using HDF5’s chunked storage layout. This means the dataset is divided up into regularly-sized pieces which are stored haphazardly on disk, and indexed using a B-tree.CellProfiler does indeed use a chunked storage format.

HDF5 is intended as a optimized format for data storage and management, not necessarily for analysis and exploration, Is there any reason you’ve chosen not to use spreadsheets or the like for viewing your data?
-Mark

We create massive amounts of single cell data, and are interested in single cell tracking/ population dynamics etc.
I want chunks of the data loaded in R, for now I use HDF5 as a makeshift “database” - which works out OK-ish using the HDF5 interface package rhdf5. Once I have some more time available (after writing my defense basically) I will set up a proper MySQL database and use your exportDatabase module.

Could you use ExportToDatabase to write to an SQLite database (a locally-stored MySQL db) and then use RSQLite to access it?
-Mark