Number of genes mutated in cell lines

Hope you can help me. For the analysis I am counting the number of mutated genes in all cell lines. For this I count the number of different gene names using the “Hugo_Symbol” column from the “CCLE_mutations.csv” file (Cellular Models Mutation Public 21Q1). As a result I got the 19540 genes the have description of mutations in file.
However, the description of the “CCLE_mutations.csv” file indicates the number 18788 of genes. Are 18788 are the number of mutated in cell lines genes? If yes, then, maybe you can help me to understand which parametr I should use to count number of mutated genes correctly.

Thank you in advance,

I think the discrepency you are seeing is coming from how we’re tracking genes. In the portal, we use the entrez gene ID as the most reliable identifier for genes. Counting the unique entrez IDs gives me the count the portal reports.

However, if I count the number of hugo symbols, I get the number of genes you reported.

> length(unique(a$Entrez_Gene_Id))
[1] 18788
> length(unique(a$Hugo_Symbol))
[1] 19541

This is inconsistent and if I had to guess, I suspect we may be concatenating data from new cell lines onto the previous release, and so the same entrez ID may appear in multiple rows, some with old symbols and some with new symbols.

(Because hugo symbols do change, we avoid using them when loading data in the portal, which is probably one of the reasons we hadn’t noticed this internally.)

I’ll circle back with the folks who generate these files and see if we can change our process to get consistent hugo symbols in future releases.

Thanks for pointing out this discrepancy.