Relationship between Broad and Sanger models

I’m confused about

  1. The relationship between DepMap, CCLE, and Cell Model Passports
  2. Why Broad and Sanger are using different IDs for the cell lines (models)

I downloaded a file with a list of models from both Sanger and Broad. Then I joined them and performed some counts. There are models that seem to be in the Broad system and NOT the Sanger system, as well as visa versa.

As a result, I am not free to pick one or the other to use as the standard.

Broad vs. Sanger Cell Line IDs

Ariel Balter

30 May, 2022

DepMap Project

Broad and Sanger are both part of the DepMap project.

Sanger

Sanger has a web page for its DepMap
Models

Under this section is a link to the Cell Model
Passports
section which

provides a single location where information on Sanger DepMap cell
models is available in a user-friendly environment.

Cell Model Passports has a download
page
which provides

Stable
link
that always points to the latest version.

Broad

Broad hosts data for the DepMap project at a dedicated portal:

https://depmap.org/portal/download/

Broad also has a seemingly-related project called the [Cancer Cell Line
Encyclopedia (CCLE)](https://sites.broadinstitute.org/ccle. The CCLE
Datasets page
has a
link for an annotated list of cell lines, however, that link is dead.
The link for Processed Data leads to the DepMap download portal.

That portal lists a file called
sample_info.csv
which could very well be the annotated cell line information.

Download Sanger Model List

model_list =
  read_csv("https://cog.sanger.ac.uk/cmp/download/model_list_latest.csv.gz") %>%
  select(
    sanger_model_id = model_id,
    depmap_id = BROAD_ID,
    sanger_sample_id = sample_id,
    sanger_patient_id = patient_id,
    model_type,
    cell_line_name = model_name,
    ccle_id = CCLE_ID,
    tissue,
    cancer_type,
    cancer_subtype = cancer_type_detail,
    sample_site
  )
Rows: 1984 Columns: 51
-- Column specification -----------------------------------------------
Delimiter: ","
chr (36): model_id, model_name, synonyms, model_type, growth_proper...
dbl  (7): pmed, mutational_burden, ploidy, age_at_sampling, samplin...
lgl  (8): mutation_data, methylation_data, expression_data, cnv_dat...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Download BROAD DepMap “Sample Info”

sample_info =
  read_csv("https://ndownloader.figshare.com/files/35020903") %>%
  select(
    depmap_id = DepMap_ID,
    sanger_model_id = Sanger_Model_ID,
    ccle_id = CCLE_Name,
    cell_line_name,
    stripped_cell_line_name,
    tissue = sample_collection_site,
    cancer_type = primary_disease,
    cancer_subtype = Subtype,
    lineage,
    lineage_subtype
    )
Rows: 1840 Columns: 29
-- Column specification -----------------------------------------------
Delimiter: ","
chr (27): DepMap_ID, cell_line_name, stripped_cell_line_name, CCLE_...
dbl  (2): COSMICID, WTSI_Master_Cell_ID

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Joined

joined  = full_join(
  sample_info,
  model_list,
  by = c("sanger_model_id", "depmap_id"),
  suffix = c("_broad", "_sanger")
)

sorted_colnames =
  colnames(joined) %>%
  sort() %>%
  setdiff(., c("sanger_model_id", "depmap_id")) %>%
  c(c("sanger_model_id", "depmap_id"), .)

joined = joined %>% select(!!sorted_colnames)

Some Counts

joined %>%
  mutate(
    has_depmap_id = !is.na(depmap_id),
    has_sanger_id = !is.na(sanger_model_id)
  ) %>%
  count(has_depmap_id, has_sanger_id) %>%
  kable()
  has_depmap_id   has_sanger_id        n
  --------------- --------------- ------
  FALSE           TRUE               269
  TRUE            FALSE              687
  TRUE            TRUE              1730
joined %>%
  mutate(
    has_depmap_id = !is.na(depmap_id),
    has_sanger_id = !is.na(sanger_model_id)
  ) %>%
  group_by(model_type) %>%
  count(has_depmap_id, has_sanger_id) %>%
  kable()
  model_type   has_depmap_id   has_sanger_id        n
  ------------ --------------- --------------- ------
  Cell Line    FALSE           TRUE               195
  Cell Line    TRUE            TRUE              1715
  Organoid     FALSE           TRUE                74
  NA           TRUE            FALSE              687
  NA           TRUE            TRUE                15
joined %>%
  mutate(
    has_depmap_id = !is.na(depmap_id),
    has_sanger_id = !is.na(sanger_model_id),
    has_ccle_id_broad = !is.na(ccle_id_broad),
    has_ccle_id_sanger = !is.na(ccle_id_sanger),
    has_cell_line_name_broad = !is.na(cell_line_name_broad),
    has_cell_line_name_sanger = !is.na(cell_line_name_sanger)
  ) %>%
  count(has_depmap_id, has_sanger_id, has_ccle_id_broad, has_ccle_id_sanger, has_cell_line_name_broad, has_cell_line_name_sanger) %>%
  arrange(!has_depmap_id, !has_sanger_id, !has_ccle_id_broad, !has_ccle_id_sanger, !has_cell_line_name_broad, !has_cell_line_name_sanger) %>%
  kable()
 has_depmap_id   has_sanger_id   has_ccle_id_broad   has_ccle_id_sanger   has_cell_line_name_broad   has_cell_line_name_sanger        n
 --------------- --------------- ------------------- -------------------- -------------------------- --------------------------- ------
 TRUE            TRUE            TRUE                TRUE                 TRUE                       TRUE                          1108

 TRUE            TRUE            TRUE                TRUE                 FALSE                      TRUE                            28

 TRUE            TRUE            TRUE                FALSE                TRUE                       TRUE                             2

 TRUE            TRUE            TRUE                FALSE                TRUE                       FALSE                           14

 TRUE            TRUE            TRUE                FALSE                FALSE                      FALSE                            1

 TRUE            TRUE            FALSE               TRUE                 FALSE                      TRUE                           575

 TRUE            TRUE            FALSE               FALSE                FALSE                      TRUE                             2

 TRUE            FALSE           TRUE                FALSE                TRUE                       FALSE                          620

 TRUE            FALSE           TRUE                FALSE                FALSE                      FALSE                           63

 TRUE            FALSE           FALSE               FALSE                TRUE                       FALSE                            4

 FALSE           TRUE            FALSE               TRUE                 FALSE                      TRUE                             1

 FALSE           TRUE            FALSE               FALSE                FALSE                      TRUE                           268

(Apologies for the long delay responding to this. Somehow our system for notifying us for forum posts didn’t recognize this post until just recently.)

What you’re describing sounds like that there are inconsistencies in which models have which identifiers, and that’s not entirely surprising given the separate nature of these different projects. Both Broad and Sanger are physically receiving and screening lines from various sources. As result, there are two distinct registration systems in play, which may have lines unique to each organization.

We do periodically try to reconcile our collections with each other to identify lines which are actually the same between both organizations and resolve disagreements. However, that process is done occasionally, and based on the published data that was available at the time. As a result, the two are unlikely to ever be fully concordant.

One thing I can offer is that I would stick to using depmap_id and sanger_id, but avoid using ccle_id. At this time, the CCLE project lives on as part of DepMap data generation efforts, but the “ccle_id” nomenclature is not being maintained and so I would only use it for historical datasets which were created prior to the existence of depmap_ids.

In other words:
I wouldn’t say that Sanger or Broad can be used as “the standard” because these are two independent efforts which are have activities happening in parallel. Neither collection will be a superset of the other.

However, between CCLE IDs and DepMap IDs, you should use the DepMap IDs as they have replaced the CCLE naming scheme.

Thanks,
Phil

Hi @pmontgom. Thank you for the explanation. Is there a “most-up-to-date” concordance file between the Broad and Sanger IDs? Also, is there reliable crosswalk between COSMIC IDs the the others?

Hey, @pmontgom, where you would place the Cell Model Passports list (Model Annotation) in this discussion?

That list is updated multiple times per year. And the way I understand the Cell Model Passports project, they are trying to combine the best information from many other sources.

The ID system they are using appears to be the same as the Sanger format: SIDM#####, which I think stands for “Sanger ID Model…” The patient IDs are SIDP##### which I think stands for “Sanger ID Patient”

Would this be a good choice for gold standard?

Yes, I know they are trying to consolidate information about models. I do think they’d be a reasonable choice to use as a standard.

Another, even more expansive choice (covers more than just cancer cell lines) would be https://web.expasy.org/cellosaurus/

My impression of the two projects is that cellosaurus focuses on the clinical metadata about as many lines as possible being cited in research. Whereas the cell model passports is more focused on cancer models, and also trying to pull together which of those lines have omics, KO, etc data as well.

Thanks again. If the NIH/NCI would support one master source, gosh would that be helpful.