We have discovered issues with files in the original 20Q4 data release that have now been addressed on the public DepMap portal. Updates have been made to the ‘DepMap Public 20Q4’ dataset and are now available in the ‘DepMap Public 20Q4 v2’ dataset (as of Dec. 22, 2020). There were two separate sets of issues, described in more detail below.
The first set of issues affected the 20Q4 copy number data, as well as the CRISPR (Avana) CERES data, which is dependent on copy number information. In the ‘DepMap Public 20Q4 v2’ release we have reverted the affected files to their previous versions to remove any potentially erroneous data. The 21Q1 release in February will include the corrected copy number and CRISPR data.
The second set of issues affected some of the gene expression files. These issues were independent of the problems with the copy number pipeline described above. Hence, we corrected these files in the ‘DepMap Public 20Q4 v2’ release, rather than reverting them to their previous versions.
Thank you to those in the community who alerted us to these issues. We apologize for these mistakes and the impacts they may have had on your research.
Issues with the original 20Q4 copy number files:
- The way copy number values were calculated for genes occasionally skipped genes and produced zero values erroneously. This affected 5% of genes.
- There were around 800 lines where the gene-level copy number values were inadvertently log-transformed two times, rather than once.
- The XY panel-of-normals (PoNs) was used instead of the XX PoNs to incorporate Y chromosome copy numbers. This caused the X/Y chromosome in segment copy number values to be double the amount that they were supposed to be.
- For 344 cell lines we changed our pipelines to use available WGS data (rather than WES or SNP-array data) to estimate copy number profiles. We have observed that there are certain genomic regions in these resulting copy number profiles that are highly variable, and likely reflect technical bias. We are looking into these differences more closely and will update this post with more information when we have it.
Issues with the original 20Q4 gene expression files:
For the files
CCLE_RNAseq_reads the underlying datasets (representing TPM expression and expected counts respectively) were swapped.
For the file
CCLE_RNAseq_transcripts, data for the cell line ACH-000561 was incorrectly normalized.
Note that, starting in 20Q4, the expression datasets use log(x+1) transformation for all TPM values (including now the
CCLE_RNAseq_transcripts file), and no transformation for expected counts. See the README for more details.
We have also identified issues with the 20Q4 RNA-Seq data for the ACH-001321 cell line (TT_THYROID) and recommend not to use them in analyses. Other data for this cell line are not affected. We are working to resolve this for the 21Q1 release. The 20Q3 release contains the correct ACH-001321 RNA-Seq data.
---------------Update following 21Q1 data release [Feb. 5 2021]---------------
The following changes were made in the 21Q1 data release to address issues with the initial 20Q4 copy number data. Note that this only refers to the initial 20Q4 release (before files were reverted).
Copy number data inferred from WGS:
- Resolved multimapping issues with the WGS realigments to hg38 (that were causing noisy copy number estimates) by including the ALT mapping reference contigs
- Excluded blacklisted regions known to be difficult in sequencing and sequence mapping (e.g. centromeres) using the following list: gs://gatk-best-practices/somatic-hg38/CNV_and_centromere_blacklist.hg38liftover.list
- Increased the sensitivity of the PoN generation tool (GATK CreateReadCountPanelOfNormals) to account for the lower read depth in haploid sex chromosomes by lowering the threshold at which a region of the genome is discarded (minimum_interval_median_percentile = 5)
- Used increased smoothing parameter for segmentation with GATK ModelSegments (number-of-changepoints-penalty-factor = 5) when processing WGS data to reduce noisy copy number calls
All copy number data:
- Fixed bug where some gene copy number values were set to 0.
- Fixed bug where log-transform was applied twice to some lines.
- Fixed issue where X chromosome was being reported with a copy ratio of 2 when from female samples and 1 when from males (now normal male/female X chromosome is reported as 0.5/1 and normal male Y chromosome as 0.5).
- Resolved incorrect source annotation in segment file for a few cell lines.
- Re-extended the segments at the ends of chromosomes to cover the whole genome (segment_cn dataset).