This is a really useful discussion on multiple levels, but particularly for me as I prepare our preprint for official submission.
tl;dr - I think everything needs to be used with extreme caution, and validated for each specific application until we come up with more generalized models for microscope systems as opposed to models for specific sample types. That said, I also believe that even if a prediction isn’t 100% accurate, that doesn’t mean it isn’t 100% useful. Because sometimes the cost of 100% accuracy far exceeds the benefits of what you can do with lower accuracy - in extreme cases it is an all or nothing situation!
I think the current state of the technology is such that we must have some way to validate the output of our models for each specific application we wish to use it for. This is in contrast to deconvolution or other denoising methods, which are based on fixed rules/assumptions that are well-established and known.
I do hope to someday generate a model for each microscope instead of each sample type, or some other way of creating a more generalized model that can work for whatever unknown sample we put onto the scope. But we are not there yet.
As far as why superresolution can work for such low information content, I think of our output as a prediction that is based on past knowledge (i.e. the “content-aware” in “CARE” is not just a cool sounding acronym - it has meaning!). So if the model has seen enough training data relevant to the test-data, it can make a pretty damn good prediction about what is really going on in the image. Whether that prediction is sufficient for drawing meaningful, useful scientific conclusions is up to the researchers (and reviewers) and whether they can validate their findings.
As far as the utility of any prediction that is less than 100% accurate compared to ground truth (i.e. why not just get ground truth data?), I think we explained this fairly well in our paper. You can image faster and with lower doses than is physically possible with “ground truth” settings. That opens up new possibilities for imaging experiments.
How do we validate our model when you can’t take simultaneous high vs low res movies on our system? What we did was use the same semi-synthetic “crappifier” method applied to some ground truth data, then tested the accuracy of the prediction - FOR OUR FEATURES OF INTEREST. I’ve gotta emphasize that last part because it is entirely possible that if we tested it for some other feature, it could very well fail and we’d need to recalibrate our model, training data, etc accordingly. But for our live imaging, we were interested in two major feature types: Mitochondrial dynamics (i.e. fission/fusion) in cultured cell lines, and mitochondrial motility in neurons.
For the mito dynamics, we validated our model by looking at the ability of our model to detect (generate?) breaks vs. continuities in mitochondria. Those two types of features are what we ultimately need to be able to see if we want to detect fission events, for example.
For the neuronal motility, we focused on a key problem which is that when mitochondria move past one another, they “blend” into a single structure which makes tracking difficult. So we checked whether our model could reliably “resolve” individual mitos as they moved past one another.
Finally, we used pixel-wise metrics (i.e. PSNR/SSIM/FRC) to measure the ability for our models to at least reasonably outperform nearest neighbor methods for superresolution (i.e. bilinear), and found our model always outperformed.
One interesting thing we played with but did not include in our manuscript (or in my Twitter thread) is a control I’ve not seen elsewhere: We measured the PSNR/SSIM of individual ground truth acquisitions which I was thinking could provide a “ceiling” for what we might expect our model to be able to do. the problem is that the ground truth data is extremely sensitive to movement, noise, etc in the sample, and so our PSNR/SSIM values were lower than those of our model output. The other confounding issue is our loss function is MSE (directly related to PSNR), and so our model was already trained to minimize PSNR and I think that’s kind of “cheating”. Long story short, I think PSNR/SSIM are useful but should be interpreted with caution.