How comparable are the Achilles and CRISPR "gene dependency" files?

Similar to this question: What's the difference between Achilles_gene_effect.csv and CRISPR_gene_effect.csv?

The current README file says:

Achilles_gene_dependency.csv
Pipeline: Achilles
*Post-Chronos* Probability that knocking out the gene has a real depletion effect using gene_effect. - Columns: genes in the format “HUGO (Entrez)” - Rows: cell lines (Broad IDs)

and

CRISPR_gene_dependency.csv

Pipeline: Achilles

Gene Dependency Probabilities represent the likelihood that knocking out the gene has a cell growth inhibition or death effect. These probabilities are derived from the scores in CRISPR_gene_effect.csv as described here: https://doi.org/10.1101/720243 - Columns: genes in the format “HUGO (Entrez)” - Rows: cell lines (Broad IDs)

So the Achilles data was processed with CHRONOS. But the CRISPR wasn’t? At least the README doesn’t say it was.

What is usually considered the “best” gene dependency data set?

Hi abalter,

CRISPR_gene_dependency is created from CRISPR_gene_effect in the same way that Achilles_gene_dependency is created from Achilles_gene_effect. CRISPR_gene_effect is generated from separate Chronos matrices using Harmonia, as described in the README:

CRISPR_gene_effect.csv

Pipeline: Achilles

Gene Effect scores derived from CRISPR knockout screens published by Broad’s Achilles and Sanger’s SCORE projects.

Negative scores imply cell growth inhibition and/or death following gene knockout. Scores are normalized such that nonessential genes have a median score of 0 and independently identified common essentials have a median score of -1.

Gene Effect scores were inferenced by Chronos ( https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02540-7 )

Integration of the Broad and Sanger datasets was performed as described in https://doi.org/10.1038/s41467-021-21898-7, except that quantile normalization was not performed.

1 Like

The two datasets almost completely overlap in terms of both cell lines and genes. So if I’m doing a naive statistical analysis and want a single number to represent gene essentiality for a given cell line, what is the right approach?

My analysis involves many other datasets such as GDSC drug sensitivity, TCGA expression, etc. So I’m less concerned with the details of each pipeline than in what experts consider the most definitive measure of essentiality (or dependency.

I would use CRISPR_gene_effect

1 Like

Hi @Joshua_Dempster. I don’t think I noticed this page on the integrated datasets before, or maybe it’s new. It has a link to the integrated datasets. This dataset contains two data files: CERES_FC.txt and CRISPRcleanR_FC.txt.

The stated goal of the second paper is

Here, we investigate the integrability of the full Broad/Sanger gene-dependency datasets, yielding the most comprehensive cancer dependency resource to date, encompassing dependency profiles of 17,486 genes across 908 different cell lines that span 26 tissues and 42 different cancer types.

Also, the pipeline diagram appears to have three inputs and one output. So, I was expecting a single integrated dataset. Instead there is still a Broad (CERES) and Sanger (CRISPR) version.

Would I be correct to assume that 1) the files on the depmap download site represent the most up-to-date versions and that 2) The CRISPR essentiality scores would be the ones to use if I only pick one?

Yes, the portal contains the up to date versions of the data and that is what you should use. The dataset generated in support of the publication includes many forms of the data, not just the ones we recommend. If you were only going to use one, the integrated (CRISPR) files have the benefit of spanning more lines.

Thanks again. That’s really helpful.