DepMap cell lines are not all independent

The assumption that observations are independent lies at the heart of every statistical test that I am familiar with, yet many of the cancer cell lines whose data is published on the DepMap portal are not independent, such as cells derived from a parent cell line after selecting for drug resistance.

It is important for DepMap to clearly highlight this detail for those who wish to use this data.

Additionally, it would also help for DepMap to publish the unfiltered genetic variants for each cell line, including both somatic and germline single nucleotide variants, which would allow users to accurately identify related cell lines. Alternatively, if patient privacy is an issue, publishing a table summarizing the overlap of germline and somatic variants between cell lines, for example using the Jaccard index, could also be very useful for users.

I’ve shared a table containing the Jaccard index derived from the inferred somatic mutations in each cell line here for people to use to identify related cell lines, based on this MAF file (CCLE_mutations.csv – I lost the version number :sweat_smile:) . Although, using only somatic variants will make it difficult or impossible to accurately identify related cell lines if they have low SNV count.

Thanks for these very insightful comments.

We agree that better highlighting non-independence relationships will facilitate improvements in certain statistical analyses of the data. Indeed, this was part of our motivation for adding ‘patient_IDs’ for each cell line in the 22Q2 sample_info file, which delineate isogenic relationships between models. We also added a column “parent_depmap_id” to indicate parental/derivative model relationships. We plan to highlight these relationships more clearly on the portal going forward.

In terms of unfiltered genetic variants: you can access these, along with the raw data, for CCLE cell lines in our Terra workspace (see Where can I find the raw genomics sequencing data?). For newer cell lines we are working to share the data in an access controlled manner. The table of germline similarity is an interesting idea as well which we will discuss.

Thanks!

Thank you very much James.

I hadn’t noticed the patient_IDs and parent_depmap_id fields before. They are indeed very useful!

I think that more clearly highlighting the relationship between cell lines is important and I’m thankful to the DepMap team for taking further steps towards that.

Thank you all for building and maintaining this great resource.