Calculating Enrichment score in python when cell count for each category is known

Hi,

I am working on a python script to process output data from a CellProfiler pipeline. The pipeline filters cells into 3 different categories based on rules decided by the FastGentleBoosting algorithm in CPA. The python script will then (among other things) collect the number of cells in each category from the output spreadsheet. I also want to find the Enrichment score for each sample, to use it as a hit marker. I have not found a way to get this score from the CellProfiler pipeline, so I have tried to implement the Enrichment score calculations from CPA (found in github) into my python script. I have included all the necessary functions from all the necessary .py files.

My issue is that when I try to fit the beta-binomial, I don’t get an alpha value (only ‘nan’). I am quite sure that the reason for the issue is the data in the counts variable. In classifier.py, counts is:

counts = groupedKeysAndCounts[:, -nClasses:]

where

groupedKeysAndCounts = np.array([list(k) + vals.tolist() for k, vals
                       in dm.SumToGroup(imData, group).items()], dtype=object)

I am not sure how the data in groupedKeysAndCounts is organized, so I suspect that the beta-binomial fit function requires the data in the counts variable to be organized in some specific way.

In my python script, I have collected the number of cells in each category for each sample. As an example, I could in one sample have 178 cells in Category 1, 82 in Category 2 and 102 in Category 3. I would then expect that I could organize the counts variable like this:

counts = [178, 82, 102]

However using this list does not work, as there will be issues with calculating the value for the s variable in the dirichlet_moment_match function (polyafit.py), because m2ok - aok*aok is 0 and s becomes ‘inf’.

Could you help me with any suggestions on what data should be included in the counts variable for me to get the Enrichment score of each category? Also, if you think that the issue is coming from somewhere else, please let me know. Any help is appreciated!

Cheers,
Krister

EDIT: After some debugging and reading Minka’s paper, I figured that that the dirichlet_moment_match function requires the counts for each category of all samples in a single array to estimate alpha. When calculating the scores however, the counts variable should only contain the category counts for one sample, as I assumed previously. Thus, the alpha values are not independent between each sample, but dependent on the entire sample set.

I’ll leave the solution here, in case anyone else ever need to figure this out.

1 Like