Thank you for this valuable platform. I am trying to understand a biologically interesting observation I had using CRISPR_gene_dependency.csv from 22Q1 release.
Briefly, I binarized the probability matrix using a threshold of 0.8 and then calculated drop-out frequencies for genes in colorectal cancer, lung cancer and lymphoma cancer cell lines. I observe that the drop-out frequencies across cell lines from different tissues correlate very well which is unexpected. I can’t really explain biologically why if a gene drops out in 20% of the colorectal cancer cell lines, it is also likely to drop out 20% of the lung cancer cell lines. Could this be an artifact of the model used in probability calculations? Thanks.
Drop_out frequency plots are below:
This is a good question that I think raises a few important things to keep in mind in CRISPR analysis:
- The great majority of dependencies in any one cell line are common to most or all cell lines
- There’s a wide range in mean strength for these common dependencies
- Probability of dependency (i.e., the confidence with which you can call a knockout depleting) is inevitably related to the strength of the dependency, along with the quality of the screen.
On that last point, if you plotted mean gene_effect against the fraction cells with gene_dependency > .8, you would see a very strong relationship. So we shouldn’t interpret your plot axes as literally saying “this gene is a dependency in X% of cell lines.” Rather, they mean “this gene dependency is strong enough to be identified with 80% confidence in X% of cell lines.” Most of these genes with dropout fraction > .25 probably have some genuine viability phenotype in most or all cell lines regardless of lineage. But the weaker the phenotype, the fewer the lines in which we will be able to call the dependency. Which specific lines we can detect weak dependencies in is more a function of random noise and screen quality than tissue biology. This is why I generally advocate against thresholding and binarizing. A lot of apparent differences is just the effect of things coming just under or just over the cutoff.
A side note: the dependency probability is called individually in each cell line using the distribution of unexpressed genes and prior common essential gene effects in that line; no information is shared across lines.
Hope that helps,
Thank you for your answer and pointing out that the larger gene effects are easier to detect in more cell lines.