As part of an ongoing effort to update and improve our data analysis pipelines and release processes, we performed a systematic SNP fingerprinting analysis of all the sequencing data used in DepMap, using the Crosscheck tool (Javed et al., 2020). This allowed us to test whether our sequencing data for a given cell line were internally consistent, and whether cell lines identified as genetically matched agreed with annotations for derivative/isogenic models. Wherever possible, we also compared our omics SNP fingerprints with SNP profiles measured prior to CRISPR screening (using a Fluidigm SNP panel).
Based on this analysis, along with cross-referencing to external cell line databases such as Cellosaurus, we have identified a number of issues that we are correcting in the 21Q3 data release. The specific issues, and actions taken to correct them, are described in detail below. We apologize for any impacts these errors have had on your research, and plan to incorporate systematic genetic fingerprint testing to prevent such issues going forward.
The following sequencing data were identified as genetically mismatched to other data labeled as from the same cell line, and have been removed.
HS294T (ACH-000014) WGS: failed SNP fingerprinting QC by not clearly matching to RNA-Seq data from the same line, or other omics data for the isogenic cell line A101D (ACH-000008). This affects mutation data (legacy mutation calls from RNA-Seq are still available) and copy number data (which will be based on SNP array data until new DNA-Seq data are generated).
KMS18 (ACH-000658) RNA-Seq and WES: SNP fingerprints between these data were consistent with each other, but did not match those from raindance and hybrid capture data for this cell line. We determined that the RNA-Seq and WES data were incorrect by comparing to SNP fingerprinting measured prior to CRISPR screening, as well as observing a lack of Y-chromosome reads even though the sample is annotated as male. These RNA-Seq and WES data have been removed. This affects mRNA expression data, mutation data (legacy mutation calls from hybrid capture, and raindance are still available), and copy number data (SNP array data will be used until new DNA-Seq data are generated).
CALU1 (ACH-000511) WES: one of our WES files for CALU1 matched to CH157MN (ACH-000025) rather than other CALU1 data. This affects mutation data only. We have additional WES and WGS for this line which are correct.
G402 (ACH-000375) WES: one of our WES files for G402 matched to HCC1143 (ACH-000374) rather than other G402 data. This affects mutation data only. We have additional WES and WGS for this line which are correct.
COV362 (ACH-000278) WES: one of our WES files for COV362 matched COV434 (ACH-000123) rather than COV362. This affects mutation data only. We have additional WES and WGS for this line which are correct.
CCLFPEDS0001T (ACH-001163) RNA-Seq: SNP fingerprint matched to CCLFPEDS0003T (ACH-001164) rather than WES data from CCLFPEDS0001T or RNA data from an isogenic model. This affects mRNA expression data only.
TTC642 (ACH-001212) RNA-Seq: SNP fingerprint did not match WES data from TTC642. RNA-Seq data also clearly showed reads aligning to Y-chromosome even though the sample is annotated as female. This affects mRNA expression data and removes legacy mutation calls based on RNA-Seq data.
WSUNHL (ACH-001709) WES: SNP fingerprint did not match RNA-Seq data from WSUNHL or SNP profile measured prior to CRISPR screen. This affects mutation and copy number data, though we have correct WGS for this line. Copy number data were generated from the available WGS data, however mutation calls have been dropped this quarter and will be generated from the WGS data in the upcoming quarter.
HAP1 (ACH-002475) WES: SNP fingerprint did not match RNA-Seq data from HAP1, or RNA-Seq/WES data from a derivative line (ACH-002476). This results in loss of mutation data and copy number data for this line until we can regenerate the data.
DU145 (ACH-000979) hybrid capture: failed SNP fingerprinting QC (poor match to other RNA-Seq and DNA-Seq data for this line). This affects the legacy hybrid-capture mutation data only.
ISHIKAWAHERAKLIO02ER (ACH-000961) Sanger WES: SNP fingerprint matched HEC50B_ENDOMETRIUM (ACH-000831) rather than other data from ISHIKAWAHERAKLIO02ER. This only affects the mutation data. Note that this cell line has correct WGS data as well. Copy number data were generated from the available WGS data, however mutation calls have been dropped this quarter and will be generated from the WGS data in the upcoming quarter.
CMK (ACH-000641) Sanger WES: failed SNP fingerprinting QC, possibly due to genetic drift. This affects the legacy mutation data only.
PC3 (ACH-000090) Sanger WES: SNP fingerprint did not match that from Broad data for PC3. This only affects the mutation data. Note that this cell line has correct WGS data as well. Copy number data were generated from the available WGS data, however mutation calls have been dropped this quarter and will be generated from the WGS data in the upcoming quarter.
In each of the following cases we determined that our omics data for a given cell line, while ‘internally consistent’, were all mislabeled.
JR (ACH-001096) and SMSCTR (ACH-001196) omics data was swapped
When omics SNP fingerprints were compared to Fluidigm SNP fingerprints used to identify cell lines prior to CRISPR screening, JR omics matched SMSCTR and SMSCTR omics matched JR. The omics data swap was confirmed using the sex of the cell lines. The JR omics data was female instead of the expected male and the SMSCTR omics data was male instead of the expected female. We resolved this issue by relabeling the omics data.
HCC1588 (ACH-001078) omics data was from LS513 (ACH-000007) not HCC1588
HCC1588 omics SNP fingerprint matched LS513. This misidentification was confirmed by the RNA-seq expression and sex. HCC1588 is documented as a NSCLC line but the expression is similar to the expression of colorectal cell lines including LS513. Additionally, HCC1588 is documented as female but the omics data is male. This misidentification was previously noticed by the PRISM team and HCC1588 was removed from the PRISM data (Corsello et al. 2020). We are now removing all data for this cell line.
PC3JPC3 (ACH-002184) omics data was from PC3 (ACH-000090) not PC3JPC3
PC3JPC3 omics SNP fingerprint matched PC3. Since PC3JPC3 is documented as a Lung adenocarcinoma line that is not isogenic with PC3 (Cellosaurus CVCL_S982) we are removing the omics data for PC3JPC3.
OCILY10 (ACH-001146) omics data was from KARPAS422 (ACH-000315) not OCILY10
OCILY10 omics SNP fingerprint matched KARPAS422. This misidentification was confirmed based on characteristic mutations described in the literature. Firstly, the OCILY10 omics data contains the mutations Zhang et al. 2013 observed in KARPAS422 and doesn’t have the mutations they observed in OCILY10. Secondly, the OCILY10 omics has a specific TP53 mutation (c.955A>T) that Deng et al. 2018 documented in KARPAS422. We are removing the OCILY10 omics data.
PMFKO14 (ACH-002022) omics data was from KM12 (ACH-000969) not PMFKO14
PMFKO14 omics SNP fingerprint matched KM12. This misidentification was confirmed based on ancestry analysis and characteristic mutations described in the literature. PMFKO14 is documented as having Japanese ancestry (Cellosaurus CVCL_8747) but the omics data has European ancestry. The PMFKO14 omics also contain the mutations Berg et al., 2017 described for KM12. We also note that PMFKO14 has been previously identified to be isogenic with KM12 (Liang-Chu et al., 2015), suggesting PMFKO14 may be more broadly misidentified. We are removing the PMFKO14 omics data.
R262 (ACH-001173) and R256 (ACH-001172) omics and CRISPR data were from a U251MG (ACH-000232) derivative not R262 or R256.
R262 and R256 omics SNP fingerprints were an exact match and were very similar to the fingerprint for U251MG. This suggested that the R262 and R256 omics data were from a U251MG derivative line, which was confirmed based on CDKN2A and OR4C11 homozygous deletions that are characteristic of U251MG (Torsvik et al. 2014). Since this line has a distinct copy number profile from U251MG we are creating a new U251MG derivative U251MGDM (ACH-001172) using the omics data from both R262 and R256. The Fluidigm SNP fingerprints for the R262 and R256 CRISPR data match the omics fingerprint for U251MGDM so we will also associate R262 and R256 CRISPR data with this line.
Figure showing somewhat similar, but clearly distinct, copy number profiles for U251MG, and the newly created derivative model U251MGDM.
NCIH2077 (ACH-000010) omics data was from NCIH1581(ACH-000015)
The omics SNP fingerprints for NCIH2077 and NCIH1581 are identical. Our data are consistent with external data for NCIH1581 (Cellosaurus CVCL_1479), and we do not find evidence suggesting these lines should be isogenic, suggesting our NCIH2077 omics data is mislabeled. We are removing all of our NCIH2077 omics data.
In these cases we found that SNP profiling performed prior to CRISPR screening did not match SNP profiles in the omics data, and confirmed that the CRISPR data were mislabeled.
CRISPR data from OCIC5X (ACH-001369) was actually from OCIP5X (ACH-001370)
The SNP profile of OCIC5X omics data did not match the Fluidigm SNP profile measured prior to CRISPR screening. We confirmed by STR profiling that the CRISPR data for OCIC5X was actually from OCIP5X. We are correcting this mislabeled CRISPR data.
Raw CRISPR data from NALM1 (ACH-000462) was actually from NALM6 (ACH-000938)
We found that the SNP profile in NALM1 omics data did not match the NALM1 Fluidigm SNP profile measured prior to CRISPR screening. We then confirmed by STR profiling that the CRISPR data for NALM1 was actually from NALM6. NALM1 CRISPR data also failed QC, so only the raw NALM1 data were released (in the *_failures files). We are removing the raw data for this cell line.
We determined that the following cell line pairs were duplicates, and have merged their data, and deprecated the extraneous cell line identifiers.
RH18DM (ACH-001790) is a duplicate of RD (ACH-000169)
The RH18DM cell line was created by DepMap because the Fluidigm SNP fingerprint didn’t match RH18 as expected. SNP fingerprinting of the RH18DM omics data revealed that this cell line is a duplicate of RD so we are relabeling the omics and CRISPR data of RH18DM accordingly.
KOSC2CL343 (ACH-001543) is a duplicate of KOSC2 (ACH-002260)
The omics SNP fingerprints for KOSC2CL343 and KOSC2 are identical. Since KOSC2CL343 is documented as another name for KOSC2 (Cellosaurus CVCL_1337 ) we are relabeling the KOSC2CL343 omics and CRISPR data to be KOSC2 (and we’re using the ID ACH-001543 to represent this line).
RMS13 (ACH-001741) is a duplicate of RH30 (ACH-000833)
The omics SNP fingerprints for RMS13 and RH30 are identical. Since RMS13 is documented as another name for RH30 (Cellosaurus CVCL_0041) we are relabeling the RMS13 omics data (which is just RNA-Seq) to RH30.