Duplicate Entrez ID in the 22Q1 expression dataset

File: CCLE_expression.csv
Release: 22Q1

As part of standard analysis, I regex the column names to only keep the Entrez IDs as I find them more stable than Hugo Symbols. It seems that there are two different columns which have the same entrez_id, namely ARHGEF18 (23370) and AC008878.3 (23370). The latter I believe is just an RNA transcript, so it may have mistakenly been labeled with an entrez_id?

Hi e333,

The way we generate expression is by using star+rsem and gencode gene+transcript names and location. This is using ensembl’s definition of genes transcripts and other transcribed DNA regions… We then convert it these ids to “entrez_id (hugo_name)”. But entrez and ensembl’s definition of a gene is differrent and although we release entrez ids we are using ensembl’s definition of what is a gene. We choose ensembl’s annotations and this pipeline in general, to stay as close as possible to the GTEX pipeline.

In this case (as in multiple others), I believe this ensembl_id/entrez_id renaming explains why you are seeing this.

This conversion is often confusing and we are thinking of releasing directly the ensembl ids instead in future releases.