How comparable are the Achilles and CRISPR "gene dependency" files?

Similar to this question: What's the difference between Achilles_gene_effect.csv and CRISPR_gene_effect.csv?

The current README file says:

Achilles_gene_dependency.csv
Pipeline: Achilles
*Post-Chronos* Probability that knocking out the gene has a real depletion effect using gene_effect. - Columns: genes in the format “HUGO (Entrez)” - Rows: cell lines (Broad IDs)

and

CRISPR_gene_dependency.csv

Pipeline: Achilles

Gene Dependency Probabilities represent the likelihood that knocking out the gene has a cell growth inhibition or death effect. These probabilities are derived from the scores in CRISPR_gene_effect.csv as described here: https://doi.org/10.1101/720243 - Columns: genes in the format “HUGO (Entrez)” - Rows: cell lines (Broad IDs)

So the Achilles data was processed with CHRONOS. But the CRISPR wasn’t? At least the README doesn’t say it was.

What is usually considered the “best” gene dependency data set?

Hi abalter,

CRISPR_gene_dependency is created from CRISPR_gene_effect in the same way that Achilles_gene_dependency is created from Achilles_gene_effect. CRISPR_gene_effect is generated from separate Chronos matrices using Harmonia, as described in the README:

CRISPR_gene_effect.csv

Pipeline: Achilles

Gene Effect scores derived from CRISPR knockout screens published by Broad’s Achilles and Sanger’s SCORE projects.

Negative scores imply cell growth inhibition and/or death following gene knockout. Scores are normalized such that nonessential genes have a median score of 0 and independently identified common essentials have a median score of -1.

Gene Effect scores were inferenced by Chronos ( https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02540-7 )

Integration of the Broad and Sanger datasets was performed as described in https://doi.org/10.1038/s41467-021-21898-7, except that quantile normalization was not performed.

1 Like

The two datasets almost completely overlap in terms of both cell lines and genes. So if I’m doing a naive statistical analysis and want a single number to represent gene essentiality for a given cell line, what is the right approach?

My analysis involves many other datasets such as GDSC drug sensitivity, TCGA expression, etc. So I’m less concerned with the details of each pipeline than in what experts consider the most definitive measure of essentiality (or dependency.

I would use CRISPR_gene_effect

1 Like