Duplicate Gene IDs in CCLE_expression.csv

I found this in data downloaded today and last September.

Column headers include both “RNASEH2A (10535)” and “THSD8 (10535)”. The former looks correct, while the latter most likely should be “THSD8 (111644133)”. There look to be 15 duplicate Gene IDs in the data from both downloads.

It is worrisome that there is not a proper ontology and checking for such duplications. The data in the columns above are not identical, so fact that THSD8 is also a synonym for RNASEH2A seems likely to be a distraction.

Thanks for reporting this. We will fix this in the upcoming data releases.

We looked further into this and it appears that the issue originates from biomart. There are a few entrez IDs which do not uniquely map to Hugo symbols. We’ll follow up on this to figure out the issue.

Thanks, Javad.

NCBI certainly has unique Gene IDs for those two genes. Hugo has the correct THSD8 NCBI Gene ID as well. So Hugo or NCBI can provide self-consistent data (gene_info.gz from NCBI has both loci as these sources report). Apparently, BioMart is not to be trusted to have a consistent version of Symbol <-> Gene ID mappings.