CRISPR co-depency top hits obscured by newly added screens

I find a lot of the CRISPR co-depency top hits that I’ve been looking at for a while have changed in the new update(?). The top gene CRISPR co-dependencies are now often changed and I find that the new genes are from a set of 45 screens (newly added) - perhaps genes that have been left out of libraries before. However, the co-dependencies makes little sense to me, compared to the ones derived from >1000 datasets and so far only gives me more ‘noise’. Is it possible to retrieve gene specific co-dependency data from earlier version to exclude these newly added 45 screens or otherwise work around this?

Best,

Peter

I found the same thing. Just posted a new topic but same question. Which release should I trust?

I’ve found the same thing as well. The gene co-dependencies up until 25Q3 made sense for known co-functions and suggested plausible hypotheses for previously undocumented links between genes that correlated with orthogonal data. Now many of the co-dependencies, even many of the top few, seem completely random.

I know that 2024Q4 was good for me. In of of the following updates there was added 45 screens with genes not previously screened before. Somehow these genes very very often come up as the most co-dependent but make little sense.

Pretty sure the problem started with the 25Q3 update – everything was looking consistent through 25Q2 for me.

Hello all,

Apologies for not chiming into this thread earlier. I had initially misclassified it when I did my original triaging.

Regardless, just wanted to give folks a heads up that this is something that I’m now investigating and will update the thread once I have more information.

Thanks,

Phil

Also, if people have any specific genes they could share as examples, that would be very helpful.

Every release, we look at various global metrics to make sure that the quality of the data release is improving as well as comparing the global correlation with the previous release, so I’m fairly confident that on average, there shouldn’t be a large systematic change between the gene effect scores for 25Q2 and 25Q3. However, every release will have changes, and our global measures of “good” are always at risk of missing negative changes with affect specific sets of genes.

Any specific example genes where this issue is arises, especially cases where you have prior biological knowledge that correlates were meaningful before, would be very helpful for our investigation.

Thanks,

Phil

Thanks for looking into this Phil.

Most genes i query has the issue i would say. Search for a gene and look at the top-5 codependent genes - chances are if you press a few or all of them it will be genes that has only been scored in 46-50 screens (which you can see when you go to their gene site).

I use a browser plug-in (Gene info) that quickly give me depmap correlations but uses older data release i think (unfourtunately i don’t know which version). If i query the gene “INST6” it correlates with these other INTS genes:

However, if i go the the updated depmap (INTS6 DepMap Gene Summary) you’ll see different genes with higher correlation scores, however, they make no sense and all of top-5 have only been assessed in around 50 screens.

Example: INST6

I see this over and over for many of the genes i look at (that top hits are mainly genes that has been screened 50 times unless you have genes with very high correlation scores >0.50

I hope this makes sense

Best,

Peter

Thanks for providing this info.

Our current suspicion is that this a result of including genes which were only present in one CRISPR library, and as a result, only have values for a smaller number of cell lines. (Previously these were not included in the past gene effect score files at all) Given these genes were measured in fewer cell lines, they’re likely to have more spurious correlations which would pollute the top correlation lists.

If that’s the case, we’ll likely filter these genes out of these correlation calculations as a short term solution. If adding such a filter results in the top correlate lists looking more similar to 25Q2’s, I’ll proceed to get that change deployed

Thanks,

Phil

I’ve confirmed that there’s little overlap between top correlates as reported by the 25Q2 data and the sample correlation analysis performed on the the 25Q3 data. So as you said, this is very widespread issue.

We’ve decided, as a short term solution we’ll update the pre-computed correlations to exclude computing correlations for genes which have data for < 10% of the cell line. This results in the removal of ~500 genes which were newly introduced in the 25Q3 dataset when we decided to include gene effect scores for genes that were only assayed by a single library.

I confirmed that the overlap in top correlates between 25Q3 and 25Q2 is much higher after this filtering. Post-filtering genes, the overlap in top correlates is comparable to what I see if I look at the overlap between the 24Q4 and 25Q2 releases, so sounds like it should get us back in the vicinity of where we were at pre-25Q3. (I also confirmed that the specific example of integrator subunit genes successfully find other integrator genes at top correlates)

I’ll be proceeding to deploy this change this morning.

Thanks for reporting this and aiding in our understanding of the issue, and again sorry for the delay in getting it looked at.

Thanks,

Phil