 # Normalizing Population Has 0 Standard Deviation

Hello all,

I am new to cell image processing. I have a dataset consisting of plates of images. I used an existing ML based model to extract image features, then I found batch effects among those features.

After reading a very nice paper, I decided to try Relative Normalization to remove batch effects. Fortunately, there are untreated control cells in each plate, so I used them as my normalizing population and computed their feature mean and standard deviation for each plate.

Interestingly, the standard deviation of some normalizing population’s features in one plate is 0, where treated cells have non-zero values. I am not sure how to carry on the normalization. Perhaps I could use the mean and standard deviation of all cells in that plate instead of only untreated cells, for those features?

Any suggestions will be appreciated Assuming variance is being calculated correctly (i.e. it is not a result of NAs), those features can be removed. Zero variance features have no information, at least in the context you are using them. See https://cytomining.github.io/cytominer/reference/variance_threshold.html for how to remove them.

Thanks for the reply. The variance should be calculated correctly.

Since I am only computing variance on my negative control images (for later normalization), the variance for treated images is not necessarily 0. In this case, should I also remove these features?

Good question. I’d check the variance in the treated images. I highly doubt they’d be non zero if the variance in negative control is exactly zero.

Lmk what you get.

I would expect those features have 0 variance in treated images too.

However, some plates do have non-zero variance in treated images while having zero variance in the negative control.

One solution I can think of is to use the standard deviation of all images (control+treated) to normalize those special features in those very plates.

That’s surprising.

But I like your approach! If you can assume a random distribution of perturbations across the plates, this is perfectly sound. If not, you’d have some bias. In practice, this is usually ok (i.e. using all wells on a plate to calculate summary stats)

I am using negative control’s feature median correlation matrix to assess batch effects. The visualizations are kind of hard to interpret I am comparing two different normalization strategies:

Normalization Method Description
Negative Control Within-plate Normalization Use the mean and standard deviation of negative control within-plate to normalize each plate. If the variance of some features are 0, then I use the statistics of all images for normalization. If the variance of all images is still 0, then I give up normalizing this feature in this plate (very rarely).
All Feature Within-plate Normalization Use the mean and standard deviation of all images (control + treated) within-plate to normalize each plate. If the variance of some feature is 0, then I give up normalizing this feature in this plate (very rarely).

### After All Feature Within-plate Normalization

Because I am only visualizing the negative control features across all plates (with sorted plate number), I would expect the ideal correlation matrix to be:

1. Diagonal entries have high values
2. Off-diagonal entries have relatively low values
3. Off-diagonal entries have random and uniform value distribution

Based on the heat maps above, does it mean the second normalization strategy is better, or neither of them is actually working? I found it is quite interesting that using more than negative control features to normalize yields better results.