I am currently attempting to run a fairly basic pipeline in cell profiler.
I need to process ~50,000 files.
I am extracting metadata from the fileheaders and it seems i need to do this prior to running the analysis.
At the moment the prediction is that this will take 12 hours
PC resource usage is very low.
I am wondering if there is a better way to do this, as this is likely to become a standard workflow and this is obviously super inefficient.
I am extracting metadata from the fileheaders and it seems i need to do this prior to running the analysis. […]
I am wondering if there is a better way to do this.
Can you explain more about what you’re doing to see if there’s a way around it? Do you have multiple fields in one large file, or is it a simpler setup IE could we load these as color images then split the channels out later? It’s hard to say what to improve without knowing what your source files are, what you’re currently doing and what your end goal is.
I’m not sure there’s a code way to speed this up; reading the file header is one of the most complicated things because CellProfiler actually has to call from Python out to Java to do it.
Also, are you planning to run all 50K on one machine rather than some sort of cluster?
I am analysing a high content screen performed on a PE Opera Phenix.
There are 7 images per field, these have been stitched into a single tiff and ilastik has been used to generate probability images for segmentation. Resulting in 2 files per field.
Objects are then generated from the prob files and relevant intensities measured from the images.
All files will be run on one machine.
The extraction is really just getting the channel count…
Would it work if i just added this to the filename?
It should work just to load your Probability images (and your original images, too, if you have multiple channels) as “ColorImage” type in NamesAndTypes, then use ColorToGrey as the first 1-2 modules in your downstream pipeline to split them out into their component channels.
50K images on a single workstation can be done, but depending on your workstation’s specs vs file size, number of features generated, etc, can be dangerous WRT something locking up and crashing on you… at the very minimum, I’d make sure to use ExportToDatabase (in SQLite mode is easiest if you don’t have a MySQL host set up) rather than ExportToSpreadsheet so if it does crash you don’t lose everything you’ve done up to the crash point.
OK, Ill give that a stab, as after the 12 hour extraction it didnt add the channel data and crashed when i tried to test.
Was going to try the legacy ‘load images’ module.
The other option I was thinking was to generate a text file of metadata using imagej and load that.
PC specs are 24 cores 128GB ram
Split channels has done a great job on the images, but fails on the prob files with an index error.
Guessing this is either
channels being interpreted as time points. (Channels are seens as time when extract metadata from fileheaders is used)
Could you provide your pipeline (or at least the first few modules of your pipeline) and a single set of images, just so it’s possible to test?