How is the probability of dependency different from the gene effect score?
The “gene effect” file contains the corrected CERES scores, which measure the effect size of knocking out a gene, normalized against the distributions of non-essential and pan-essential genes. The probabilities assess, given a gene score, how likely to be a member of the non-essential distribution or the common essential distribution in that cell line. The key difference between using a fixed threshold on CERES score and a threshold on the probabilities is that the probabilities take into account the screening quality, which varies from line to line.
So which one should I use?
Depending on the question you want to ask, you may want to choose to use one measure or the other. For cases where you are interested in potentially subtle variation in the strength of killing, such as computing co-dependency correlations, using the CERES scores makes sense. However, if you are only interested in binary relationships of which lines are killed or not, for example, when looking for biomarkers which classify lines into sensitive or insensitive, then the dependency probabilities may make more sense to use.
I’m a computationalist and I want gene scores with no copy number corrections or other fancy processing. How can I get them?
Starting with the matrix logfold_change, you can use guide_gene_map to group rows (guides) by gene and summarize by median, mean, or other function. Then, group the columns (replicates) by cell line using replicate_map and summarize by mean or median again.
What thresholds should I use to decide if a gene is really having a significant effect on a cell line?
Although it depends on the risk of false positives you’re willing to tolerate, for most applications a cutoff of 0.5 in gene dependency probability or greater makes sense. For gene effect, a score less than -0.5 represents depletion in most cell lines, while less than -1 represents strong killing.
What does a positive CERES score mean?
It indicates that when you knock out the gene, the cell line grows faster. For example, TP53 has a positive score in most p53-wt cell lines. However, considerable caution should be used interpreting positive scores. We’ve found that many outgrowths in CRISPR data appear to be random. For example, in some cases outgrowth occurs for only one guide in one replicate, or occurs for unexpressed genes. Any event that grants a fitness advantage can cause clonal outgrowth and may have nothing to do with the targeted gene.
Why are some cell lines not showing up in the results when certain genes are searched in the combined RNAi dataset?
Differences in the shRNA libraries used to screen different cell lines can lead to differences in the set of gene scores being available. Most notably, cell lines screened using only the Novartis DRIVE libraries will only have gene scores for around half of genes. Additional constraints on the set of shRNAs targeting a given gene can also influence whether gene scores will be available for a given gene. See the DEMETER2 paper for more details.