Missing genes in TOPMed gencode30 gtf file

Hello,

I wanted to apply TMM normalization and RPKM standartization on you recent CCLE expression release (CCLE_expression_v2.csv). To access the gene lengths I downloaded the GTF file from your github repository (gencode.v30.GRCh38.ERCC.genes.collapsed_only.gtf.gz). However, this file does not contain the gene coordinates for 235 genes, which are part of the expression file (e.g. ENSG00000011052, ENSG00000026036, ENSG00000064489, ENSG00000093100, ENSG00000108825, ENSG00000114786, …). Did I misunderstand the pipeline or use the wrong gtf file?

Thanks a lot

Hi. We have made some changes to the GTEx pipeline. Some of our annotations may also be out of date. Please use the Gencode v29 instead. You can find the file here:

ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gff3.gz

Hey, this has now been updated. Starting 20Q4, we are still using Gencode v29 in our RSEM and STAR algorithms. However additional gene annotations (HGNC symbols, Entrez IDs, biotype informations…) now come from the latest version of the ensembl biomart database. https://uswest.ensembl.org/info/data/index.html

2 Likes

@jkobject Does this mean that for the latest CCLE_expression.csv file reads were still mapped to genes (STAR) using Gencode v29?

Yes. They are. We will make it clear in our changelog/README file if we make a change to the gene mapping. You can follow which inputs we have used in the latest run of our pipeline, by looking at this file in our github:

This is the configuration file used in our terra.bio pipeline.

Best,

@jkobject Great, thanks so much!

1 Like