Duplicate Entrez ID in the 22Q1 expression dataset

e333 · April 6, 2022, 1:46am

File: CCLE_expression.csv
Release: 22Q1

As part of standard analysis, I regex the column names to only keep the Entrez IDs as I find them more stable than Hugo Symbols. It seems that there are two different columns which have the same entrez_id, namely ARHGEF18 (23370) and AC008878.3 (23370). The latter I believe is just an RNA transcript, so it may have mistakenly been labeled with an entrez_id?

jkobject · April 13, 2022, 3:51pm

Hi e333,

The way we generate expression is by using star+rsem and gencode gene+transcript names and location. This is using ensembl’s definition of genes transcripts and other transcribed DNA regions… We then convert it these ids to “entrez_id (hugo_name)”. But entrez and ensembl’s definition of a gene is differrent and although we release entrez ids we are using ensembl’s definition of what is a gene. We choose ensembl’s annotations and this pipeline in general, to stay as close as possible to the GTEX pipeline.

In this case (as in multiple others), I believe this ensembl_id/entrez_id renaming explains why you are seeing this.

This conversion is often confusing and we are thinking of releasing directly the ensembl ids instead in future releases.

Best,

Topic		Replies	Views
Duplicate Gene IDs in CCLE_expression.csv Report an Issue data	3	323	February 18, 2022
21Q1 CCLE_mutations incorrect Entrez_Gene_Id & format of Codon_Change Issues and Bugs data	1	335	May 24, 2021
Number of genes mutated in cell lines Current Issues data	2	691	January 27, 2022
Entrez_Gene_Id for TBCE and PINX1 Q&A data	5	714	January 22, 2021
Entrez Gene ID, a tracked integers ends with ".0" in OmicsSomaticMutations.csv Report an Issue	1	225	July 6, 2023

Duplicate Entrez ID in the 22Q1 expression dataset

Related topics