Number of genes mutated in cell lines

Hi!
Hope you can help me. For the analysis I am counting the number of mutated genes in all cell lines. For this I count the number of different gene names using the “Hugo_Symbol” column from the “CCLE_mutations.csv” file (Cellular Models Mutation Public 21Q1). As a result I got the 19540 genes the have description of mutations in file.
However, the description of the “CCLE_mutations.csv” file indicates the number 18788 of genes. Are 18788 are the number of mutated in cell lines genes? If yes, then, maybe you can help me to understand which parametr I should use to count number of mutated genes correctly.

Thank you in advance,
Sincerely,
Darya

I think the discrepency you are seeing is coming from how we’re tracking genes. In the portal, we use the entrez gene ID as the most reliable identifier for genes. Counting the unique entrez IDs gives me the count the portal reports.

However, if I count the number of hugo symbols, I get the number of genes you reported.

> length(unique(a$Entrez_Gene_Id))
[1] 18788
> length(unique(a$Hugo_Symbol))
[1] 19541

This is inconsistent and if I had to guess, I suspect we may be concatenating data from new cell lines onto the previous release, and so the same entrez ID may appear in multiple rows, some with old symbols and some with new symbols.

(Because hugo symbols do change, we avoid using them when loading data in the portal, which is probably one of the reasons we hadn’t noticed this internally.)

I’ll circle back with the folks who generate these files and see if we can change our process to get consistent hugo symbols in future releases.

Thanks for pointing out this discrepancy.

Thanks,
Phil

Dear Mr. Phil,

Greetings, I have same issue in this dataset.
I downloaded 21Q4 dataset and counted the number of unique genes by Hugo_Symbol in the mutations dataset, but I confirmed that the dataset has 19536 genes that not matched to reported counts (18784).
How can I deal with this problem?

Also, I checked that some gene has Entrez_Gene_ID as 0.
I confirmed that those gene has synonym in NCBI Gene.



Which is the best way to treat those cases?


I try to update the Hugo_Gene ID for uniqueness.
Is it okay to use Biomart to convert the ID that from Human genes (GRCh38.p13)?
Are the GRCh38.p13 and NCBI_Build(37) different?


I checked that the gene names in expression data and mutation data are different.
For example, CEP162, whose previous name was KIAA1009, could be find in expression data.
But KIA1009 was not able to find in expression data.
Instead, KIAA1009 exist in mutation data.
Can you consider to synchronize the dataset in 21Q4 from CCLE?

I look forward to your reply.
Sincerely,
Songyeon