First of all, I’d like to thank everyone involved in this project for the incredibly useful resource you have provided to the community.
I am interested in carrying out some bespoke analyses of Common Essential genes (e.g. by organelle, by cell lineage etc.) I am however finding it difficult to filter Common Essential genes, as I can’t seem to identify lists of genes that match up. For example, a subset of genes from the Common Essential list (from any of the studies) don’t appear on the Gene Effect list (“CRISPR (DepMap 21Q3 Public+Score, Chronos)”).
I appreciate that much of the numerical data is derived, and the effect scores are dependent on many factors, including post-hoc corrections for various parameters. However, as a general rule to simplify my analysis of whichever gene list I download, would it be sensible to:
Calculate a mean effect score across all cell lines
Use -0.5 as a cut off for common essential/non common essential
Apologies if the above isn’t very clear, and many thanks in advance!
Hello! Apologies for the confusion, there are multiple common essentials lists in our dataset.
The common_essentials file contains the set of genes used as positive controls, aggregated from two orthogonal knockout studies (Blomen et al. 2015 and Hart et al. 2015). There are a few (~10) genes in this list which do not appear in the gene effect matrices.
Separate from this list, we provide two other common essentials lists: Achilles_common_essentials and CRISPR_common_essentials. These lists are derived from the corresponding Achilles and CRISPR gene effect matrices (ie. without the use of the control essentials list), as described in the Identifying common dependencies section of this preprint: https://www.biorxiv.org/content/10.1101/720243v1.full.
Note that the CRISPR_gene_effect matrix (equivalent to “CRISPR (DepMap 21Q3 Public+Score, Chronos)”) represents data integrated from Achilles and SCORE, meaning it contains only genes which overlap between the two projects. If you were using the common_essentials or Achilles_common_essentials files as your gene lists, that would explain why some genes didn’t match up. Instead, I would suggest using the CRISPR_common_essentials file from the downloads page as your list of common essentials.
Please let me know if I can further clarify anything.
Hello @iboyle , thank you for your detailed answer.
About the method used to identify the common essential genes in the publication you have posted, is the code for implementing this method publicly available?
Hi Rohan, here is the code we use in our pipeline to call essentials.
import pandas as pd
import numpy as np
from scipy.stats import gaussian_kde
def find_essentials(gene_effect, percentile=.9, max_thresh=.3):
'''
Find a list of essential genes based on consistently high dependency score in large number of cell lines.
Parameters:
gene_effect (matrix) - genes as columns
percentile (float) - fraction of cell lines a given gene must exceed dependency threshold
for in order to be considered panessential
max_thresh (float) - threshold for calling essentials
Returns:
dependency threshold value,
list of essential genes
'''
#rank genes by percentile in each line
ranks = gene_effect.rank(axis=1, pct=True)
#rank gene ranks by percentile in each gene
ranks_lines = ranks.rank(axis=0, pct=True)
#find min rank for lines with rank above percentile threshold
gene_scores = ranks[ranks_lines > percentile].min(axis=0)
#fit a KDE
kde = gaussian_kde(gene_scores.dropna(), bw_method=.1)
test_points = np.arange(.1, .8, .001)
#get natural cutoff suggested by the data
threshold = test_points[np.argmin(kde(test_points))]
if threshold > max_thresh:
raise Exception(
'found threshold for calling essential genes too high at {}'.format(threshold)
)
#select essentials
essentials = pd.Series(gene_scores.index[gene_scores < threshold], name="essentials")
#check mean
ess_mean = gene_effect[essentials].mean().mean()
if ess_mean > -.5:
raise Exception(
'the mean of all {} found essential genes'.format(len(essentials)) +
'({}) is higher than expected'.format(ess_mean)
)
return threshold, essentials
Hello again @iboyle, I was able to obtain the common essentials. Thank you very much for sharing the code.
There were a couple of things in the code that I was not very sure about. Both of them are related to the max_thresh parameter. Clarification on these would only help me in making sure that I am executing the code correctly.
Could you please let me know (1) why the default value for the max_thresh parameter is set at 0.3 and (2) what should the user do in the case of ‘found threshold for calling essential genes too high at {max_thresh}’ error?
Thank you.
Hi Rohan, the preprint I shared describes setting the threshold for essentials in a data-driven manner. If you look at Fig 12b, there is a clear bimodality in the distribution of gene ranks in Achilles, with the threshold dynamically selected as the minimum density between these peaks. If the threshold is greater than 0.30, our pipeline raises an exception as this indicates more essentials are being called than we typically expect to see in the Achilles dataset. If you are getting this error, I would suggest generating a plot similar to Fig12b and visually examining your distribution. Perhaps the dataset you are using contains only a small number of lines or is biased towards a specific cancer type, meaning more genes appear as commonly essential across your dataset?
Hi @iboyle ,
Thank you very much for explaining the rationale behind the default value of the max_thresh parameter. That’s basically what I wanted to know. Also thank you for the suggestion to check how the distribution looks like. That was very also very helpful.
Here’s few lines of code I used to plot the distribution:
## Distribution of gene scores in their 90-percentile least depleted lines. (Fig. 12b)
fig,ax=plt.subplots()
xs = np.arange(0, 1, .001)
ax.plot(xs,kde(xs))
ax.axvline(threshold,linestyle=":",color='red')
ax.text(s='threshold\nfor\ncalling\nessentials',x=threshold,y=1,color='red')
ax.set(xlabel='rank of gene in 90th percentile\nleast depleted cell line',
ylabel='density')
If inserted above the return line of the find_essentials, it should make a distribution plot similar to the Fig 12b of the preprint. I hope it could be of help for somebody.