The code you linked to is an older version than what the portal is using today. Realistically, I don’t think there are any substantive differences between the version you link and the version we use today. (I think most of the differences since that paper was written were just packaging of the code, replacing R code with python, etc)
Clearly we’re close but not quite reproducing the results. Further questions:
The documentation states that the top 1,000 features are selected by Pearson correlation, but the pipeline appears to instead be ranking by f_regression (run_ensemble.py, line 479). Can you clarify which is being used to generate the portal results?
Noting that cross-validation is permutation based, do you set the RNG seed? If not, do you know what seed was used for the portal results? How can we match it? We set python’s seed to 42 above.
Can you confirm that the pipeline’s input data matches the “Custom Downloads” interface? (Accessed via the “Download Input data” link on a gene’s Predictability tab, and filtered to those listed in the Model’s “Feature Sets” popup.)
a. We noticed that the mutation files (both damaging and hotspot) use mutation counts per cell line (instead of Boolean) – is this consistent with the pipeline?
b. Does lineage_1 from the “Add cell line metadata to download” checkbox match the lineage the pipeline uses?
Thank you again for your help! Best regards,
Brendan