CRISPR Pipeline and Analysis

The DepMap CRISPR Knockout pipeline, Achilles, generates large-scale high-quality screening data for identifying genetic dependencies across a wide range of human cancer cell lines. These screens allow researchers to pinpoint genes whose knockout leads to growth inhibition or cell death, providing valuable insights into cancer vulnerabilities.

DepMap screening data can be explored using Data Explorer 2, or downloaded from the following datasets:

Screen Types

  • Standard genome-wide knockout (KO)
  • RNAi

Screening Libraries

Pooled screens are completed in technical replicates at between 350x-1000x coverage of the library. Additional information about other metrics can be found in the ScreenSeqenceMap.csv.

Experimental quality controls

  • Cell line identity, or the fingerprint, is verified before, during, and after screening using STR analysis.
  • Loss of function negative selection screens are performed at 14-21 days.
  • Library DNA quality and evenness is validated by guide fractional rank curves.
  • Screened cell lines generally have >70% Cas9 or Cas12a activity, except in rare cases.*
  • Library transduction is performed at an infection efficiency of ~45%. This keeps the multiplicity of infection at ~1 while allowing the screens to remain size-efficient.

*Exceptions to screening cell lines occur only when theCas9 or Cas12a activity is only slightly below 70% and the model is a rare cancer cell line with a good doubling time.

Chronos

Below are some resources for Chronos modeling and training details as well as benchmarking against similar tools such as MAGeCK and BAGEL:

Chronos Algorithm Methods
Methods | Slidedeck | Blog Post

Pipeline

This Achilles dataset contains the results of genome-scale CRISPR knockout screens for Achilles (using Avana Cas9 and Humagne-CD Cas12 libraries) and Achilles combined with Sanger’s Project SCORE (KY Cas9 library) screens.

The dataset was processed using the following steps:

  • Sum raw readcounts by SequenceID and collapse pDNAs by median (removing sequences with fingerprinting issues, total reads below 500000 or less then 25 mean reads per guide)
  • Normalize readcounts such that pDNA sequences have the same mode, and all other sequences have median of nonessentials that matches the median of nonessentials in the corresponding pDNA batch
  • Remove intergenic/non-targeting controls, Cas9 guides with multiple alignments to a gene, guides with >5 genomic alignments, and sgRNAs with inconsistent signal (see DropReason of guide maps).
  • NA sgRNAs with pDNA counts less than one millionth of the pDNA pool
  • Calculate the mean reads per guide for each sequence. Sequences with more then 185 mean reads per guide are considered passing.
  • Compute the log2-fold-change (LFC) from pDNA counts for each sequence. Collapse to a gene-level by taking the median of guides.
  • Calculate the Null-Normalized Median Difference (NNMD) from the LFC using the following equation: (median(essentials) - medain(nonessentials)) / MAD(nonessentials). Sequences must be below a threshold of -1.25 to be considered passing.
  • Calculate the fraction of reads originating from the other library than what is annotated. The fraction of reads must be below a threshold of 0.1 to be considered passing.
  • Calculate residual LFC for each library and replicate by creating a linear model of the replicate’s LFC as a function of the mean LFC across replicates. Only genes that show high variance in gene effect across screens and libraries are considered when calculating residual LFC.
  • Remove sequences that do not have a Pearson coefficient > .19 with at least one other replicate sequence for the screen using the residual LFC. Remove sequences that have a Pearson coefficient > .8 with any other sequence that is not part of the same screen.
  • Calculate the NNMD for each screen after averaging passing sequences. Screen must be below a threshold of -1.25 to be considered passing.
  • NA out readcounts for pDNAs which correspond with a SNP in any guide in a vector for a given cell line (to correct ancestry bias), then recompute LFC.
  • Compute the naive gene score by collapsing the passing LFC data to a screen x gene matrix.
  • Prior to running Chronos, NA apparent outgrowths in readcounts.
  • Run Chronos per library-screen type to generate screen-level gene effect scores, apply copy number correction, then scale such that median of common essentials is at -1.0 and median of nonessentials is at 0, and correct for screen quality. To correct proximity bias, the median gene effect of each chromosome arm is aligned to be the same across all screens. See Venceti et al. 2024 (A benchmark of computational methods for correcting biases of established and unknown origin in CRISPR-Cas9 screening data | Genome Biology | Full Text) for more details on this correction. This produces the integrated CRISPRGeneEffect matrix using Chronos’s innate batch correction. Concatenate gene effects from all libraries into a single ScreenGeneEffect matrix.
  • Run Chronos jointly on all libraries-screen types to generate model-level gene effect scores, apply copy correction, scale, and correct screen quality as described above. To correct proximity bias, the median gene effect of each chromosome arm is aligned to be the same across all screens. See Venceti et al. 2024 (A benchmark of computational methods for correcting biases of established and unknown origin in CRISPR-Cas9 screening data | Genome Biology | Full Text) for more details on this correction. This produces the integrated CRISPRGeneEffect matrix using Chronos’s innate batch correction.
  • Using the CRISPRGeneEffect, identify pan-dependent genes as those for whom 90% of cell lines rank the gene above a given dependency cutoff. The cutoff is determined from the central minimum in a histogram of gene ranks in their 90th percentile least dependent line.
  • For each Chronos gene effect score from both the ScreenGeneEffect and CRISPRGeneEffect, infer the probability that the score represents a true dependency. This is done using an EM step until convergence independently in each screen/model. The dependent distribution is given by the list of essential genes. The null distribution is determined from unexpressed gene scores in those cell lines that have expression data available, and from the nonessential gene list in the remainder.
    The essential and nonessential controls used throughout the analysis are the Hart reference nonessentials and the intersection of the Hart and Blomen essentials. See Hart et al., Mol. Syst. Biol, 2014 and Blomen et al., Science, 2015. They are provided with this release as AchillesCommonEssentialControls.csv and AchillesNonessentialControls.csv.