Random forest code

Hi Phil, Simone, DepMap team,

I’m interested in running the code underlying the Predictability tabs, which predicts dependency scores from baseline omics. I found the methodology preprint via the readme (https://www.biorxiv.org/content/10.1101/2020.02.21.959627v3), and its apparently-associated github repo (depmap-crispr-vs-rnai/src/ensemble_prediction_pipeline at main · broadinstitute/depmap-crispr-vs-rnai · GitHub).

Can you confirm if this github repo still matches the pipeline you use to generate the predictability tab data? If not, can you please point me to it?

Thank you very much - Brendan

The code you linked to is an older version than what the portal is using today. Realistically, I don’t think there are any substantive differences between the version you link and the version we use today. (I think most of the differences since that paper was written were just packaging of the code, replacing R code with python, etc)

Regardless, you can find the code that the portal is currently using in this repo: GitHub - broadinstitute/cds-ensemble

Thanks
Phil

1 Like

Hi Phil,

Thank you for your help! We have partially reproduced the 25Q2 portal results for TP53 Core Omics:

Us:

Rank Gene_Symbol Feature_Type Importance_Score
1 EDA2R Expression 39.6%
2 TP53 Damaging Mutations 14.8%
3 CHD9NB Expression 1.45%
4 ScreenMADNonessentials Confounders 1.10%
5 STON2 Expression 0.91%
6 CHST3 Expression 0.66%
7 ARHGEF19 Expression 0.64%
8 CDKN1A Expression 0.61%
9 ATOSB Expression 0.58%
10 ASS1 Expression 0.58%

Portal:

Clearly we’re close but not quite reproducing the results. Further questions:

  1. The documentation states that the top 1,000 features are selected by Pearson correlation, but the pipeline appears to instead be ranking by f_regression (run_ensemble.py, line 479). Can you clarify which is being used to generate the portal results?
  2. Noting that cross-validation is permutation based, do you set the RNG seed? If not, do you know what seed was used for the portal results? How can we match it? We set python’s seed to 42 above.
  3. Can you confirm that the pipeline’s input data matches the “Custom Downloads” interface? (Accessed via the “Download Input data” link on a gene’s Predictability tab, and filtered to those listed in the Model’s “Feature Sets” popup.)
    a. We noticed that the mutation files (both damaging and hotspot) use mutation counts per cell line (instead of Boolean) – is this consistent with the pipeline?
    b. Does lineage_1 from the “Add cell line metadata to download” checkbox match the lineage the pipeline uses?

Thank you again for your help! Best regards,
Brendan

edit: Attaching our model’s metrics in case they’re helpful
TP53_CoreOmics_Results.csv (1.5 KB)