DepMap cell lines are not all independent

The assumption that observations are independent lies at the heart of every statistical test that I am familiar with, yet many of the cancer cell lines whose data is published on the DepMap portal are not independent, such as cells derived from a parent cell line after selecting for drug resistance.

It is important for DepMap to clearly highlight this detail for those who wish to use this data.

Additionally, it would also help for DepMap to publish the unfiltered genetic variants for each cell line, including both somatic and germline single nucleotide variants, which would allow users to accurately identify related cell lines. Alternatively, if patient privacy is an issue, publishing a table summarizing the overlap of germline and somatic variants between cell lines, for example using the Jaccard index, could also be very useful for users.

I’ve shared a table containing the Jaccard index derived from the inferred somatic mutations in each cell line here for people to use to identify related cell lines, based on this MAF file (CCLE_mutations.csv – I lost the version number :sweat_smile:) . Although, using only somatic variants will make it difficult or impossible to accurately identify related cell lines if they have low SNV count.

Thanks for these very insightful comments.

We agree that better highlighting non-independence relationships will facilitate improvements in certain statistical analyses of the data. Indeed, this was part of our motivation for adding ‘patient_IDs’ for each cell line in the 22Q2 sample_info file, which delineate isogenic relationships between models. We also added a column “parent_depmap_id” to indicate parental/derivative model relationships. We plan to highlight these relationships more clearly on the portal going forward.

In terms of unfiltered genetic variants: you can access these, along with the raw data, for CCLE cell lines in our Terra workspace (see Where can I find the raw genomics sequencing data?). For newer cell lines we are working to share the data in an access controlled manner. The table of germline similarity is an interesting idea as well which we will discuss.

Thanks!

Thank you very much James.

I hadn’t noticed the patient_IDs and parent_depmap_id fields before. They are indeed very useful!

I think that more clearly highlighting the relationship between cell lines is important and I’m thankful to the DepMap team for taking further steps towards that.

Thank you all for building and maintaining this great resource.

It seems like you’re highlighting important considerations regarding the assumptions of independence in statistical tests and the potential lack of independence in cancer cell line data available on the Urology portal. Additionally, you’ve suggested ways in which DepMap could enhance the transparency of their data by providing information on genetic variants and relatedness between cell lines.

Your concern about the lack of independence in cancer cell lines, especially those derived from parent cell lines after selecting for drug resistance, is valid. Dependencies between cell lines can introduce biases in statistical analyses and may impact the generalizability of results. It’s crucial for data repositories like DepMap to provide clear documentation on the characteristics of the data, including any non-independence issues.

Your suggestion to publish unfiltered genetic variants for each cell line, encompassing both somatic and germline single nucleotide variants, or providing a summary of the overlap using metrics like the Jaccard index, is a thoughtful approach. This would indeed aid researchers in accurately assessing relatedness between cell lines and understanding the genetic landscape.

However, your acknowledgment of the potential challenges in accurately identifying related cell lines with low somatic single nucleotide variant (SNV) counts is also important. This limitation should be communicated clearly to users to ensure proper interpretation of the data.

If privacy concerns related to patient data are present, providing aggregated summaries rather than raw genetic information could be a reasonable compromise. It’s essential for data repositories to strike a balance between openness and protecting sensitive information.

Your proactive approach in sharing a table with Jaccard indices derived from inferred somatic mutations is commendable. Collaboration and sharing such information within the research community contribute to the advancement of knowledge and the improvement of data analysis practices.

Consider reaching out to DepMap directly with your suggestions. Providing constructive feedback and engaging in a dialogue with data providers can lead to positive changes and improvements in data transparency and usability.