Analyst does not keep track of fetched objects


I am using the developer’s version on linux of CPAnalyst so I don’t know if this is fixed in Mac/Windows version. I noticed that CPAnalyst does not keep track of the objects you already fetched or classified. So, lets say you already fetched and classified and then fetch some new objects, some of the objects you already classified can show up again and you may add the same object twice which I don’t think is good for the classifier since it will add more weight to the same object. What I found more critical is that you can add the same object to different bins with no error message.

Since it’s difficult to keep visual track of all the objects you classified in a large dataset I think this may be a problem when classifying objects in the boundaries between two classes that may fall into either bin due to people’s subjectivity when classifying. Have you thought about this? Is there a way to solve this problem?

Thank you very much!

P.S: I didn’t know if I had to post this in the Help or Bugs section. Please move the post if necessary.

Thanks, Juan, for bringing up this interesting topic. It is one that we’ve spent a good time thinking about here. You may be surprised, but the short answer is that this is a feature and not a bug.

To elaborate, if a user drops the same object into separate bins, then it’s not clear where that object belongs, nor is it clear that the user would be more accurate if we alerted them of this… so, we let the training data fall where the user places it. We have written quite a bit about training philosophies and techniques in the manual, and encourage everyone to read it. I believe that classifying an object as + and - has a very similar (if not identical) effect to not classifying the object at all, however I did not write the classification code, so the mathematical explanation is left to my coworker Ray who will probably point you to the manual.

On a related note, we recently learned of some research being done that has shown somewhat alarming inconsistencies between manual cell phenotype classifications performed by different biologists. I can’t remember the numbers, but the take-home message was that 2 biologists phenotyping cells together will be significantly more accurate than just one or the other doing the phenotyping alone. So… allowing a single cell to be put in a bin twice, or put in both bins, only gives us more confidence in the accuracy of the classifier since it takes into account the accuracy of inaccuracy of the users manual classifications.

On a side note, the software has been written to cater towards high throughput screens that typically contain between 1 and 100 million cells, so a user creating a training set with ~1000 cells in each bin is extremely unlikely to overtrain on any one particular cell.

Hope this has been helpful,

One more note: I do have one related item on my todo list which is:

“The N objects fetched should probably be unique.” … that is, while we will not ensure objects in each bin are unique, we’ll make sure that if you ask for 20 positive objects, you get 20 unique positive objects – or in the case of extremely low penetrance phenotypes: as many as classifier can find before you get sick of waiting.