Random forest code

bav · June 6, 2025, 4:07pm

Hi Phil, Simone, DepMap team,

I’m interested in running the code underlying the Predictability tabs, which predicts dependency scores from baseline omics. I found the methodology preprint via the readme (https://www.biorxiv.org/content/10.1101/2020.02.21.959627v3), and its apparently-associated github repo (depmap-crispr-vs-rnai/src/ensemble_prediction_pipeline at main · broadinstitute/depmap-crispr-vs-rnai · GitHub).

Can you confirm if this github repo still matches the pipeline you use to generate the predictability tab data? If not, can you please point me to it?

Thank you very much - Brendan

pmontgom · June 10, 2025, 7:50pm

The code you linked to is an older version than what the portal is using today. Realistically, I don’t think there are any substantive differences between the version you link and the version we use today. (I think most of the differences since that paper was written were just packaging of the code, replacing R code with python, etc)

Regardless, you can find the code that the portal is currently using in this repo: GitHub - broadinstitute/cds-ensemble

Thanks
Phil

bav · July 1, 2025, 2:12pm

Hi Phil,

Thank you for your help! We have partially reproduced the 25Q2 portal results for TP53 Core Omics:

Us:

Rank	Gene_Symbol	Feature_Type	Importance_Score
1	EDA2R	Expression	39.6%
2	TP53	Damaging Mutations	14.8%
3	CHD9NB	Expression	1.45%
4	ScreenMADNonessentials	Confounders	1.10%
5	STON2	Expression	0.91%
6	CHST3	Expression	0.66%
7	ARHGEF19	Expression	0.64%
8	CDKN1A	Expression	0.61%
9	ATOSB	Expression	0.58%
10	ASS1	Expression	0.58%

Portal:

Clearly we’re close but not quite reproducing the results. Further questions:

The documentation states that the top 1,000 features are selected by Pearson correlation, but the pipeline appears to instead be ranking by f_regression (run_ensemble.py, line 479). Can you clarify which is being used to generate the portal results?
Noting that cross-validation is permutation based, do you set the RNG seed? If not, do you know what seed was used for the portal results? How can we match it? We set python’s seed to 42 above.
Can you confirm that the pipeline’s input data matches the “Custom Downloads” interface? (Accessed via the “Download Input data” link on a gene’s Predictability tab, and filtered to those listed in the Model’s “Feature Sets” popup.)
a. We noticed that the mutation files (both damaging and hotspot) use mutation counts per cell line (instead of Boolean) – is this consistent with the pipeline?
b. Does lineage_1 from the “Add cell line metadata to download” checkbox match the lineage the pipeline uses?

Thank you again for your help! Best regards,
Brendan

edit: Attaching our model’s metrics in case they’re helpful
TP53_CoreOmics_Results.csv (1.5 KB)

Topic		Replies	Views
Question on unreleased metrics: "Gene Confidence" and "Predictability" Q&A	2	586	March 25, 2021
A question about differences between Top Co-dependencies and Predictability data on CRISPR Q&A	2	354	June 27, 2023
Announcing the 22Q4 Release! Announcements	1	3932	February 8, 2023
Announcing the 24Q2 Release Announcements	1	2748	June 3, 2024
Announcing the 25Q2 Release Announcements	2	135	June 30, 2025

Random forest code

Related topics