CRISPR co-depency top hits obscured by newly added screens

I find a lot of the CRISPR co-depency top hits that I’ve been looking at for a while have changed in the new update(?). The top gene CRISPR co-dependencies are now often changed and I find that the new genes are from a set of 45 screens (newly added) - perhaps genes that have been left out of libraries before. However, the co-dependencies makes little sense to me, compared to the ones derived from >1000 datasets and so far only gives me more ‘noise’. Is it possible to retrieve gene specific co-dependency data from earlier version to exclude these newly added 45 screens or otherwise work around this?

Best,

Peter

I found the same thing. Just posted a new topic but same question. Which release should I trust?

I’ve found the same thing as well. The gene co-dependencies up until 25Q3 made sense for known co-functions and suggested plausible hypotheses for previously undocumented links between genes that correlated with orthogonal data. Now many of the co-dependencies, even many of the top few, seem completely random.

I know that 2024Q4 was good for me. In of of the following updates there was added 45 screens with genes not previously screened before. Somehow these genes very very often come up as the most co-dependent but make little sense.

Pretty sure the problem started with the 25Q3 update – everything was looking consistent through 25Q2 for me.

Hello all,

Apologies for not chiming into this thread earlier. I had initially misclassified it when I did my original triaging.

Regardless, just wanted to give folks a heads up that this is something that I’m now investigating and will update the thread once I have more information.

Thanks,

Phil

Also, if people have any specific genes they could share as examples, that would be very helpful.

Every release, we look at various global metrics to make sure that the quality of the data release is improving as well as comparing the global correlation with the previous release, so I’m fairly confident that on average, there shouldn’t be a large systematic change between the gene effect scores for 25Q2 and 25Q3. However, every release will have changes, and our global measures of “good” are always at risk of missing negative changes with affect specific sets of genes.

Any specific example genes where this issue is arises, especially cases where you have prior biological knowledge that correlates were meaningful before, would be very helpful for our investigation.

Thanks,

Phil

Thanks for looking into this Phil.

Most genes i query has the issue i would say. Search for a gene and look at the top-5 codependent genes - chances are if you press a few or all of them it will be genes that has only been scored in 46-50 screens (which you can see when you go to their gene site).

I use a browser plug-in (Gene info) that quickly give me depmap correlations but uses older data release i think (unfourtunately i don’t know which version). If i query the gene “INST6” it correlates with these other INTS genes:

However, if i go the the updated depmap (INTS6 DepMap Gene Summary) you’ll see different genes with higher correlation scores, however, they make no sense and all of top-5 have only been assessed in around 50 screens.

Example: INST6

I see this over and over for many of the genes i look at (that top hits are mainly genes that has been screened 50 times unless you have genes with very high correlation scores >0.50

I hope this makes sense

Best,

Peter

Thanks for providing this info.

Our current suspicion is that this a result of including genes which were only present in one CRISPR library, and as a result, only have values for a smaller number of cell lines. (Previously these were not included in the past gene effect score files at all) Given these genes were measured in fewer cell lines, they’re likely to have more spurious correlations which would pollute the top correlation lists.

If that’s the case, we’ll likely filter these genes out of these correlation calculations as a short term solution. If adding such a filter results in the top correlate lists looking more similar to 25Q2’s, I’ll proceed to get that change deployed

Thanks,

Phil

I’ve confirmed that there’s little overlap between top correlates as reported by the 25Q2 data and the sample correlation analysis performed on the the 25Q3 data. So as you said, this is very widespread issue.

We’ve decided, as a short term solution we’ll update the pre-computed correlations to exclude computing correlations for genes which have data for < 10% of the cell line. This results in the removal of ~500 genes which were newly introduced in the 25Q3 dataset when we decided to include gene effect scores for genes that were only assayed by a single library.

I confirmed that the overlap in top correlates between 25Q3 and 25Q2 is much higher after this filtering. Post-filtering genes, the overlap in top correlates is comparable to what I see if I look at the overlap between the 24Q4 and 25Q2 releases, so sounds like it should get us back in the vicinity of where we were at pre-25Q3. (I also confirmed that the specific example of integrator subunit genes successfully find other integrator genes at top correlates)

I’ll be proceeding to deploy this change this morning.

Thanks for reporting this and aiding in our understanding of the issue, and again sorry for the delay in getting it looked at.

Thanks,

Phil

I think this problem has reoccurred, albeit less severely, in the 26Q1 release. For genes whose function and prior pre-computed correlations I’m familiar with, there are some new top correlated genes with known but apparently non-related function. All these new correlated genes were analyzed in just a few cell types.

I agree with Ruth that this problem has reappeared. For example for the gene SYS1 the genes that correlate are all part of a pathway, apart from the fourth highest rank (PRB3) which is a salivary gland protein of no conceivable relevance. It has only been analysed in 74 cell lines, and also gives a very strong but implausible correlation with several other genes. Please could these genes from <~500 cell lines be removed, or at least a filter option provided to set the number of cell lines a gene has to have been tested in before a correlation is shown. DepMap is incredibly powerful at predicting functional links and these genes from small numbers of cell lines obscure this and so make it less valuable. Many thanks.

In the most recent release, we switched to using a p-value threshold for filtering out cases where low number of cell lines resulted in spurious correlations. However, it sounds like the current cut off is not stringent enough.

The pre-computed correlations are also used for other parts of the portal where we are showing correlations to drug responses, and some of the drug screens have smaller sample sizes. In the past, we were a little hamstrung about how to filter things because we didn’t have a convenient place where we could change the rules for different datasets. However, we’re in a better place with this present release.

I’m planning to carve out some time today or tomorrow to see if there’s a quick way to raise the threshold for the specific places where we report co-dependency correlations, because it’s clear that people find this information useful but we have too many false positives in what we’re reporting presently.

(I’m also thinking I should surface the sample size/p-values for the co-dependencies reported. We didn’t used to bother storing that information, but I believe we do now, so I suspect we should reporting that could also help.)

Thanks,

Phil

Many thank for offering to address this. It will be very useful indeed. Sean