[IMPORTANT UPDATE] Issues with DepMap 20Q4 data

We have discovered issues with files in the original 20Q4 data release that have now been addressed on the public DepMap portal. Updates have been made to the ‘DepMap Public 20Q4’ dataset and are now available in the ‘DepMap Public 20Q4 v2’ dataset (as of Dec. 22, 2020). There were two separate sets of issues, described in more detail below.

The first set of issues affected the 20Q4 copy number data, as well as the CRISPR (Avana) CERES data, which is dependent on copy number information. In the ‘DepMap Public 20Q4 v2’ release we have reverted the affected files to their previous versions to remove any potentially erroneous data. The 21Q1 release in February will include the corrected copy number and CRISPR data.

The second set of issues affected some of the gene expression files. These issues were independent of the problems with the copy number pipeline described above. Hence, we corrected these files in the ‘DepMap Public 20Q4 v2’ release, rather than reverting them to their previous versions.

Thank you to those in the community who alerted us to these issues. We apologize for these mistakes and the impacts they may have had on your research.

Issues with the original 20Q4 copy number files:

  1. The way copy number values were calculated for genes occasionally skipped genes and produced zero values erroneously. This affected 5% of genes.
  2. There were around 800 lines where the gene-level copy number values were inadvertently log-transformed two times, rather than once.
  3. The XY panel-of-normals (PoNs) was used instead of the XX PoNs to incorporate Y chromosome copy numbers. This caused the X/Y chromosome in segment copy number values to be double the amount that they were supposed to be.
  4. For 344 cell lines we changed our pipelines to use available WGS data (rather than WES or SNP-array data) to estimate copy number profiles. We have observed that there are certain genomic regions in these resulting copy number profiles that are highly variable, and likely reflect technical bias. We are looking into these differences more closely and will update this post with more information when we have it.

Issues with the original 20Q4 gene expression files:

For the files CCLE_expression_full and CCLE_RNAseq_reads the underlying datasets (representing TPM expression and expected counts respectively) were swapped.

For the file CCLE_RNAseq_transcripts, data for the cell line ACH-000561 was incorrectly normalized.

Note that, starting in 20Q4, the expression datasets use log(x+1) transformation for all TPM values (including now the CCLE_RNAseq_transcripts file), and no transformation for expected counts. See the README for more details.

We have also identified issues with the 20Q4 RNA-Seq data for the ACH-001321 cell line (TT_THYROID) and recommend not to use them in analyses. Other data for this cell line are not affected. We are working to resolve this for the 21Q1 release. The 20Q3 release contains the correct ACH-001321 RNA-Seq data.

---------------Update following 21Q1 data release [Feb. 5 2021]---------------
The following changes were made in the 21Q1 data release to address issues with the initial 20Q4 copy number data. Note that this only refers to the initial 20Q4 release (before files were reverted).

Copy number data inferred from WGS:

  • Resolved multimapping issues with the WGS realigments to hg38 (that were causing noisy copy number estimates) by including the ALT mapping reference contigs
  • Excluded blacklisted regions known to be difficult in sequencing and sequence mapping (e.g. centromeres) using the following list: gs://gatk-best-practices/somatic-hg38/CNV_and_centromere_blacklist.hg38liftover.list
  • Increased the sensitivity of the PoN generation tool (GATK CreateReadCountPanelOfNormals) to account for the lower read depth in haploid sex chromosomes by lowering the threshold at which a region of the genome is discarded (minimum_interval_median_percentile = 5)
  • Used increased smoothing parameter for segmentation with GATK ModelSegments (number-of-changepoints-penalty-factor = 5) when processing WGS data to reduce noisy copy number calls

All copy number data:

  • Fixed bug where some gene copy number values were set to 0.
  • Fixed bug where log-transform was applied twice to some lines.
  • Fixed issue where X chromosome was being reported with a copy ratio of 2 when from female samples and 1 when from males (now normal male/female X chromosome is reported as 0.5/1 and normal male Y chromosome as 0.5).
  • Resolved incorrect source annotation in segment file for a few cell lines.
  • Re-extended the segments at the ends of chromosomes to cover the whole genome (segment_cn dataset).
6 Likes

Thanks for the update!

Another small bug that you may want to fix while looking into the copy number data is on the segmentation file. The naming of the chromosomes is not consistent and for some cell lines you have the “Chr” prefix as part of the chromosome name while in other cell lines the prefix is missing.

Cheers,

Thanks Julio! We’ve got this queued up to be fixed in the 21Q1 release.

Hi James,
My interpretation of this post is that the data is ok in the “Depmap Public 20q4 V2” downloads section. However, I noticed after downloading the copy number data that it doesn’t match the results on the website (i.e. checking using the “tools -> data explorer” section of depmap.org), whereas the 20q3 copy number data DOES match the website.

Additionally, I noticed that the 20q4 v2 CRISPR gene effect data is also just slightly different from what’s showing up on Depmap.org, whereas again the 20q3 CRISPR data is not.

I think going forward I’m just going to stick with the 20q3 data unless I hear otherwise.

Thanks
Ryan

I investigated and discovered a mistake introduced in the “DepMap Public 20Q4 v2” release on the portal. While the portal was correctly loaded, and we appear to have successfully updated the release on figshare at DepMap 20Q4 Public, the new links in the download section of the portal continued to link to files in the first version of the release on figshare, not the latest version.

This happened due to a mistake in how one of our release scripts was run. I’ve also added an additional assertion to prevent this mistake from happening in the future.

I’ve deployed an update which corrects the links to point to the correct versions of the files.

Thanks again for letting us know about this problem and your continued patience.

CCLE_gene_cn.csv 20Q4 looks good now. The attached graphs are just random samples from 20Q4 and 20Q3.

Thanks!