I was reading a really helpful blog post written by @Joshua_Dempster on “gene confidence scores” to help assess the reliability / utility of the DepMap data for a gene of interest:
Assessing Confidence in Achilles Gene Profiles
The approach made a lot of sense, and I would like to try to replicate it to gain insight into a few genes of interest. However, I noticed that two of the data sources were not publicly available, but pulled from internal files via taiga:
(1) NormLRT scores
gene_summary = tc.get(name="summary-table-0720", file='Target-Discovery-20Q2-internal')
(2) Gene “predictability”
predictions_full = tc.get(name='predictability-d5b9', file='ensemble-regression-complete')
In a previous post, I was able to figure out how to calculate (1) - LRT scores - which agreed with some internal code posted by a DepMap team member (NormLRT code availability).
However, I’m not sure how to figure out (2) - gene ‘predictability’. I’m assuming this corresponds to the model performance of an unbiased model (using multiple feature types - RNA expression, mutations, CNA, GE scores, etc…) in predicting a given gene’s GE score, implemented either using a Random Forest as in Josh’s preprint, - or - perhaps from running ATLANTIS.
I will try generating a model using ATLANTIS or RF (via scikit-learn), but I am curious what the structure of the underlying data are, and how they are generated. Can you say whether you use “predictability” scores generated from ATLANTIS or RF?
Of course, if you have plans to make either of these files publicly available any time soon it would make life a lot easier. Are there plans to release scores for either “gene predictability” (along with feature importance scores) or “gene confidence scores” any time soon?