DepMap is excited to announce several important updates and pipeline improvements in this release.
DepMap has begun moving towards more complex experiments, new libraries, and nontraditional models. To facilitate these exciting new data types, we have had to update our data design. Previously, all DepMap data assumed that a given cell line only appeared once in a given file. However, we realize that DepMap users may want to compare the same cell model screened in different conditions - for example, as a 2D line versus an organoid, or in one library versus another. Due to the large number of significant changes this has led to in this release, if you have previously used DepMap data, we strongly recommend consulting the README for this dataset before attempting to load the data.
First, to accommodate richer types of data, we are introducing a multi-level data hierarchy:
- At the top of the hierarchy is the Patient
- Each patient can have one or more derived Models: a collection of cells derived from a single biopsy of the patient
- Each model can be cultured in one or more conditions; the unique combination of a model with a condition is a ModelCondition
- Each ModelCondition can undergo a CRISPR Screen or be sequenced to produce an Omics Profile
We will be providing data principally at two levels: Models and Screens/Profiles. The Model level data will be very similar to what you have seen in the past, albeit with different file names in some cases. For example, CRISPRGeneEffects is indexed by ModelID (ACH-XXXXXX) and will have the same format as in previous releases. OmicsExpressionProteinCodingGenesTPMLogp1.csv is equivalent to the previous CCLE_expression.csv file. These Model-level matrices represent a consensus estimate of the data for the model.
CRISPR data is now produced by running Chronos jointly on all included Screens for a Model in different conditions. To integrate these new data types, Chronos has been updated. It now has a built-in estimator of excess noise (overdispersion relative to Poisson) in CRISPR readcount data and the ability to estimate and remove library batch effects. Additionally, we are working on adding the ability to process individual screens with Chronos using the parameters learned by training Chronos on the full DepMap dataset. This requires that a screen be in one of the DepMap libraries (currently Avana, KY and Humagne). We are considering adding more libraries to the core model.
Other notable new CRISPR files include NaiveScreenEffect, which is the result of taking the median of log fold-changes for all guides targeting a gene and mean of all replicates for the screen. This may be useful as an unprocessed estimate of gene KO effect.
For omics, the most robust profile of a condition best representing the Model’s basal state is chosen for Model level data, and the profile best representing the ModelCondition is chosen for CRISPR data processing. A mapping of these choices is provided in OmicsDefaultModelProfiles.csv and OmicsDefaultModelConditionProfiles.csv.
Other important changes you will notice include an overall change in the Primary Disease, Lineage, Subtype and Sub-subtype fields which have been replaced by standardized Oncotree terms with 3 tiers: OncotreeLineage, OncotreePrimaryDisease and OncotreeSubtype. In addition, we have added the OncoTree code for the finest grained categorization we have for each model in the column “OncotreeCode.” Should you need it, a supplemental file with mapping of the old lineages to the new lineases is available on the forum.
We recognize that changing data release formats causes difficulties for our users. However, these changes will provide a more robust framework for the new types of data DepMap will be generating, and will reduce the need to introduce major changes in future releases. To consolidate “breaking changes,” we have also renamed files and columns to follow more consistent conventions.
Achilles ancestry bias correction updates
DepMap will be releasing binary mutation matrices for locations in the guide libraries (KY, Avana and Humagne) used for Achilles’ ancestry bias correction. This correction involves NAing the readcounts for constructs containing a guide which aligns with a SNP in a given cell line.
Changes to Achilles QC
To accommodate the new integrated Chronos run across libraries, the following quality control changes have been introduced this release:
- SequenceMeanReadsPerGuide: we now filter sequences with low readcounts by taking the mean reads per guide rather than the total reads, as the number of reads scales with the library size. The new threshold is 185 reads, which is equivalent to the previous threshold of 15 million total reads used for Avana sequences.
- SequenceNNMD/ScreenNNMD: the calculation of NNMD has been updated to be more robust and is now calculated as (median(essentials) - median(nonessentials)) / MAD(nonessentials). The new threshold is -1.25, which leads previously passing screen SC-000498.AV01 to fail.
- SequenceMaxCorr: we have updated the list of high variance genes used to compute sequence correlation to be the union of genes which are high variance within each individual library (see AchillesHighVarianceGeneControls). The new threshold used for sequence correlation is 0.41 which roughly preserves the percentile of failing screens, as a result:
- The following previously failing screens now pass due to these changes: SC-000090.AV02, SC-000252.AV01, SC-000396.AV01, SC-000479.AV01, SC-000609.AV01, SC-000920.AV01, SC-001119.AV01, SC-001270.AV01, SC-001367.AV01, SC-001392.AV01, SC-001497.AV01
- The following previously passing screens now fail due to these changes: SC-000096.AV01, SC-000167.AV01, SC-000172.AV01, SC-000249.AV01, SC-000583.AV01, SC-000814.AV01, SC-001332.AV01, SC-001484.AV01, SC-001485.AV01, SC-002004.AV01, SC-002055.AV01
The full set of metrics computed for sequence-level and screen-level QC can be found in the AchillesSequenceQCReport and the AchillesScreenQCReport respectively, see the README for more details.
Misidentified cell lines
We have identified a cell line with apparent contamination, which has been removed from the data. All previously released data from ACH-002741 were removed due to contamination.
During the Oncotree coding, we also noticed a few misclassified cell lines. These lines remain in the data but now have a new lineage annotation which can be found in the mapping file.
Omics Changes
DepMap has made several major changes to the Omics pipeline that are important to note.
Removal of legacy data
For a number of models, DepMap has relied on a mix of old technologies such as SNP-based copy number calling and hybrid capture/raindance-based mutation calling. We call this mix “legacy data.” DepMap is moving towards removing legacy data and instead profiling these lines with whole genome sequencing. In this initial phase, all legacy data have been removed from the Mutations file. However, all legacy data are still preserved for Copy Number.
Mutation calling pipeline changes
DepMap has updated the mutation calling pipeline to use Mutect2, which is simpler, more reproducible, more maintainable and provides better annotation compared to our previous pipeline. For more details, please review the mutation pipeline documentation here. Note that this significantly changes annotations available in the MAF.
Mutation matrices changes
In this release, we have removed “non-conserving” and “other conserving” matrices and changed the definition of “hotspot.” Details on these changes can be found in the mutation document here.
Whole Genome Sequencing quality control changes
DepMap has removed segments on the Y chromosome for lines that have more than 150 segments on the Y chromosome. After this filter, lines that have more than 1500 total segments QC failures will be marked.
- ACH-002981’s WGS profile is now considered a QC failure and removed from Copy Number and Mutation data.
- ACH-001955, ACH-001956, ACH-001957 and ACH-000116 have been dropped from the Copy Number and Mutation data due to low quality.
Copy Number data changes
DepMap data no longer extends segments into cytobands as they were not reflective of the real copy number. The extension was previously performed by assuming the copy number of the last called segment was the same as the cytoband.
Expression data changes
DepMap has rebuilt the STAR and RSEM indices with reference genome that includes ERCC spike-in annotations and excludes ALT, HLA, and decoy contigs (gs://gtex-resources/references/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.dict, gs://gtex-resources/references/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta, gs://gtex-resources/references/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta.fai). For more details, please refer to GTEx’s RNAseq pipeline documentation.
Gene set enrichment
Now included as part of our release are gene set enrichment scores computed for each gene set in MSIGDB v7. These were computed using single sample gene set enrichment analysis applied to the z-scores of the gene-level expression data. Details about the R script used to run the analysis can be found here: genepy/genepy/rna at master · broadinstitute/genepy · GitHub