We have discovered issues with files in the original 20Q4 data release that have now been addressed on the public DepMap portal. Updates have been made to the ‘DepMap Public 20Q4’ dataset and are now available in the ‘DepMap Public 20Q4 v2’ dataset (as of Dec. 22, 2020). There were two separate sets of issues, described in more detail below.
The first set of issues affected the 20Q4 copy number data, as well as the CRISPR (Avana) CERES data, which is dependent on copy number information. In the ‘DepMap Public 20Q4 v2’ release we have reverted the affected files to their previous versions to remove any potentially erroneous data. The 21Q1 release in February will include the corrected copy number and CRISPR data.
The second set of issues affected some of the gene expression files. These issues were independent of the problems with the copy number pipeline described above. Hence, we corrected these files in the ‘DepMap Public 20Q4 v2’ release, rather than reverting them to their previous versions.
Thank you to those in the community who alerted us to these issues. We apologize for these mistakes and the impacts they may have had on your research.
Issues with the original 20Q4 copy number files:
- The way copy number values were calculated for genes occasionally skipped genes and produced zero values erroneously. This affected 5% of genes.
- There were around 800 lines where the gene-level copy number values were inadvertently log-transformed two times, rather than once.
- The XY panel-of-normals (PoNs) was used instead of the XX PoNs to incorporate Y chromosome copy numbers. This caused the X/Y chromosome in segment copy number values to be double the amount that they were supposed to be.
- For 344 cell lines we changed our pipelines to use available WGS data (rather than WES or SNP-array data) to estimate copy number profiles. We have observed that there are certain genomic regions in these resulting copy number profiles that are highly variable, and likely reflect technical bias. We are looking into these differences more closely and will update this post with more information when we have it.
Issues with the original 20Q4 gene expression files:
For the files
CCLE_RNAseq_reads the underlying datasets (representing TPM expression and expected counts respectively) were swapped.
For the file
CCLE_RNAseq_transcripts, data for the cell line ACH-000561 was incorrectly normalized.
Note that, starting in 20Q4, the expression datasets use log(x+1) transformation for all TPM values (including now the
CCLE_RNAseq_transcripts file), and no transformation for expected counts. See the README for more details.
We have also identified issues with the 20Q4 RNA-Seq data for the ACH-001321 cell line (TT_THYROID) and recommend not to use them in analyses. Other data for this cell line are not affected. We are working to resolve this for the 21Q1 release. The 20Q3 release contains the correct ACH-001321 RNA-Seq data.