Looking at the 22Q4 mutational data, I am realizing that some genes known to be mutated in CCLE cell lines are missing any mutation annotation (missing from the “overview” and the “characterization” tabs). An example is the ADAMTS2 gene for which no mutation information is listed. In the 22Q2 dataset (CCLE_mutations.csv), several SNPs were listed (223 to be precise) and I believe that they used to be visible in the 22Q2 portal characterization tab. SNPs for this gene are also listed in the 22Q4 “OmicsSomaticMutations.csv”. file, so it is not like the new mutation calling pipeline that is now being used has failed to “rediscover” these mutations.
The cell line “ACH-000943” is one of many listed in the OmicsSomaticMutations file as having a missense mutation to the ADAMTS2 gene, but this mutation is not listed in the mutation data for this cell line, on the portal “characterization” tab.
I am also finding that the 22Q4 “Damaging mutation” table contains data for “only” 16383 genes and ADAMTS2 gene is missing from the list. I thought that maybe now only “Damaging mutations” are shown on the portal, and perhaps the way to assess if a SNP is damaging or not has changed with the latest data release. However for other genes, silent mutations are listed (e.g. ADAMTS7 has mutation data listed in the overview tab and its silent mutations show up in the characterization tab, including one in the same “ACH-000943” cell line).
I would like to understand what is going on.
Thanks for your help.
The short version:
The reason why mutations for this gene (and others) appear in the OmicsSomaticMutations mutation file but are absent from the portal is due to these genes missing a valid Entrez Gene ID. This investigating lead us to discover there is a problem our code which looks up Entrez Gene ID in Biomart.
We plan on fixing that as part of the next data release.
I looked into how many Hugo Symbols have an “Unknown” Entrez Gene ID, and there were ~6k which seems like a lot. However, when I look into what types of genes they are I see most are listed as non-protein coding genes, and the portal only displays protein coding genes. (Biomart reports most of these having gene types as lncRNA, processed_pseudogene, unprocessed_pseudogene, etc) Of that 6k, there’s only ~500 protein coding genes which have an Unknown Entrez Gene ID, or in other words <2% of the total number of genes in the mutations file. But again, 500 is more than I’d expect to be missing. ( And ADAMTS2 is included in that list of protein coding genes that somehow failed to find an Entrez Gene ID.)
I was expecting there to be some, because after identifying mutations, we annotate those with gene information using Funcotator, which provides us with an Ensemble Gene IDs. However, our other tools and the DepMap portal use Entrez Gene IDs to uniquely identify genes, so we use BioMart to map the Ensemble Gene ID to Entrez Gene IDs. However, there is not a strictly one-to-one mapping between Entrez IDs and Ensembl IDs as there are differences between NCBI’s database and Ensembl’s.
However, there does appear to be a problem in the code which did the mapping which caused some genes to not get assigned Entrez Gene IDs like they should have been. We’ll fix that for the next data release.
This problem has likely been out there for a awhile now. Thanks for reporting it so we can get it fixed!
Thanks Phil, for looking into it and providing the detailed explanation. I will look forward to the next data release.