Theoretical questions

cellprofiler-analyst

#1

Hi guys,

I’m using CP and CPA to differentiate between dead and alive cells. I’m really happy with the way your software works. I’m using classifier in CPA to train the machine to automatically fill the two/three classes I have…but I have a question regarding how is the right way to use CPA classifier or, well, I’n not sure I’m using it in the right way.

I understand that I can train the machine to recognise my cells as dead or alive (and it works very well). At the moment I’m doing so on images coming from every single sample in my experiment. In fact, for staining issues and sample variability, I’m not able to build a general rule to identify dead/alive cells. I think the best possible approach would be to have a general training set to apply to each sample, isn’t it? So I’m not sure I’m doing the right way.

Another question is related to the theoretical principles about the way machine learning alorithms are applyed in quantitative microscopy. I know that if you have a dataset that you want to classify you usually divide it into a fraction used to train the machine (usually above 60%) and the remaining part of your data are used to validate the rules the machine used to classify them. Here we are dealing with another approach, because we are presenting few examples relative to the total amount of images or objects we want to recognise. I know I’d rather ask this kind of problem to someone in my lab/institution, but it’s actually hard for me to find someone working on this topic…(I’m not in the medical field)

Anyway…Thank you very much for this useful software.


#2

Hi,

Someone else may answer this better. In the mean time, here’s my humble opinion for your two questions :

  • If you have sample variability (batch effect), and you still want to use machine learning (ML), for example when the number of samples is large > 100, you may have to find a way to first normalize all the samples, for e.g. CorrectIlluminationCalculate and Apple across all the images.
    Otherwise, yes, it’s better to do rule-each-sample.
    Batch effect is certainly a huge pain in ML.

  • For 2nd question: given we have access to a full set of original data, we often split them into, say, 80% for training, 20% for validation. Both have to be annotated, i.e. ground-truth for which is dead cell, which is alive. After this training, we establish a set of rule for classification, and then use this set of rule to predict new un-annotated data.
    (I think) In CPA, when you hand pick a few cells, e.g. 20 dead + 20 alive cells, you in fact doing the annotation. And these 40 cells will be also split to 32 cells for training, 8 for validation. It may also involve augmentation, and/or k-fold cross validation under the hood to improve training accuracy.
    After this training, a set of rule is established and will be used to predict the rest unseen data (generalization).
    In fact the more object you annotate in the starting batch, the more accurate the generalization will be.

Hope that helps.


#3

Hi Minh,

Thank you very much for your kind reply. I think it’s very clear what you mean:

  • I already subtract the background from images to highlight foreground and it works actually much better than CorrectIllumination in my case (sparse objects), but I still have small differences from one sample to another, probably related to experimental variation (i.e. experimantal variation in the staining procedure). So I suppose I have to stick to a sample-based approach…

  • Now I understand the philosophy! Thank you!

Is there any way to learn more about this kind of topic? Do you know if there are courses online. I know there are very good online courses on ML, but I wonder if there is also some on quantitative microscopy_and_machine learning too.

Thank you for your time and help.