Missing genes in TOPMed gencode30 gtf file

jkreis · July 7, 2020, 10:20am

Hello,

I wanted to apply TMM normalization and RPKM standartization on you recent CCLE expression release (CCLE_expression_v2.csv). To access the gene lengths I downloaded the GTF file from your github repository (gencode.v30.GRCh38.ERCC.genes.collapsed_only.gtf.gz). However, this file does not contain the gene coordinates for 235 genes, which are part of the expression file (e.g. ENSG00000011052, ENSG00000026036, ENSG00000064489, ENSG00000093100, ENSG00000108825, ENSG00000114786, …). Did I misunderstand the pipeline or use the wrong gtf file?

Thanks a lot

jnoorbak · July 13, 2020, 3:02pm

Hi. We have made some changes to the GTEx pipeline. Some of our annotations may also be out of date. Please use the Gencode v29 instead. You can find the file here:

ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gff3.gz

jkobject · November 10, 2020, 2:47pm

Hey, this has now been updated. Starting 20Q4, we are still using Gencode v29 in our RSEM and STAR algorithms. However additional gene annotations (HGNC symbols, Entrez IDs, biotype informations…) now come from the latest version of the ensembl biomart database. https://uswest.ensembl.org/info/data/index.html

yonnierosensk · April 25, 2021, 1:36pm

@jkobject Does this mean that for the latest CCLE_expression.csv file reads were still mapped to genes (STAR) using Gencode v29?

jkobject · April 26, 2021, 4:05pm

Yes. They are. We will make it clear in our changelog/README file if we make a change to the gene mapping. You can follow which inputs we have used in the latest run of our pipeline, by looking at this file in our github:

github.com

broadinstitute/depmap_omics/blob/master/RNA_pipeline/all_configs.json

{"GENERAL": {"accessLevel": "OWNER", "bucketOptions": {"requesterPays": false}, "canCompute": true, "canShare": true, "catalog": false, "owners": ["jkalfon@broadinstitute.org", "aborah@broadinstitute.org", "jnoorbak@broadinstitute.org", "ccle-pipeline@firecloud.org", "gmiller@broadinstitute.org"], "workspace": {"attributes": {"ref_fasta": "gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta", "dbSnpVcfIndex": "gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.dbsnp138.vcf.idx", "ref_dict": "gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.dict", "hg38_star_fusion_ctat_files_v33": {"itemsType": "AttributeValue", "items": ["gs://ccle_default_params/references/GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play/ctat_genome_lib_build_dir/blast_pairs.idx", "gs://ccle_default_params/references/GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play/ctat_genome_lib_build_dir/ref_genome.fa.fai", "gs://ccle_default_params/references/GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play/ctat_genome_lib_build_dir/ref_annot.prot_info.dbm", "gs://ccle_default_params/references/GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play/ctat_genome_lib_build_dir/fusion_annot_lib.idx", "gs://ccle_default_params/references/GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play/ctat_genome_lib_build_dir/AnnotFilterRule.pm", "gs://ccle_default_params/references/GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play/ctat_genome_lib_build_dir/ref_genome.fa", "gs://ccle_default_params/references/GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play/ctat_genome_lib_build_dir/pfam_domains.dbm", "gs://ccle_default_params/references/GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play/ctat_genome_lib_build_dir/ref_annot.gtf.mini.sortu", "gs://ccle_default_params/references/GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play/ctat_genome_lib_build_dir/ref_annot.gtf.gene_spans", "gs://ccle_default_params/references/GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play/ctat_genome_lib_build_dir/trans.blast.align_coords.align_coords.dbm", "gs://ccle_default_params/references/GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play/ctat_genome_lib_build_dir/ref_annot.cds", "gs://ccle_default_params/references/GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play/ctat_genome_lib_build_dir/trans.blast.align_coords.align_coords.dat", "gs://ccle_default_params/references/GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play/ctat_genome_lib_build_dir/ref_annot.pep", "gs://ccle_default_params/references/GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play/ctat_genome_lib_build_dir/ref_annot.gtf"]}, "description": "## DepMap RNAseq for HG38 \n\nThis workspace contains the workflows and pipelines used by DepMap to generate RNAseq-based features for cell lines. This workspace is for hg38 based alignment, which will start being released in 19Q2.\n\n### Expression\nWe use the GTEx pipeline (https://github.com/broadinstitute/gtex-pipeline/blob/v9/TOPMed_RNAseq_pipeline.md).\nTo generate the expression dataset, run the following tasks on all samples that you need, in this order:\n`samtofastq_v1-0_BETA_cfg `\n(broadinstitute_gtex/samtofastq_v1-0_BETA Snapshot ID: 5)\n`star_v1-0_BETA_cfg`\n(broadinstitute_gtex/star_v1-0_BETA Snapshot ID: 7)\n`rsem_v1-0_BETA_cfg`\n(broadinstitute_gtex/rsem_v1-0_BETA Snapshot ID: 4)\nrsem_aggregate_results_v1-0_BETA_cfg (broadinstitute_gtex/rsem_aggregate_results_v1-0_BETA Snapshot ID: 3)\n\nThe outputs to be downloaded will be saved under the sample set that you ran. The outputs we use for the release are:\n- `rsem_genes_expected_count`\n- `rsem_genes_tpm`\n- `rsem_transcripts_tpm`\n\n**Make sure that you delete the intermediate files. These files are quite large so cost a lot to store. To delete, you can either write a task that deletes them or use gsutil rm***\n\n### Fusions\nWe use STAR-Fusion https://github.com/STAR-Fusion/STAR-Fusion/wiki. The fusions are generated by running the following tasks\nhg38_STAR_fusion (gkugener/STAR_fusion Snapshot ID: 14)\nAggregate_Fusion_Calls (gkugener/Aggregate_files_set Snapshot ID: 2)\n\nThe outputs to be downloaded will be saved under the sample set you ran. The outputs we use for the release are: \nfusions_star\n\nThis task uses the same samtofastq_v1-0_BETA_cfg task as in the expression pipeline, although in the current implementation, this task will be run twice. It might be worth combing the expression/fusion calling into a single workflow. This task also contains a flag that lets you specify if you want to delete the intermediates (fastqs). \n\nThere are several other tasks in this workspace. In brief:\n- Tasks prefixed with __EXPENSIVE__ or __CHEAP__ are identical to their non-prefixed version, except that they specify different memory, disk space, etc. parameters. These versions can be used when samples fail the no

This file has been truncated. show original

This is the configuration file used in our terra.bio pipeline.

Best,

yonnierosensk · April 26, 2021, 8:08pm

@jkobject Great, thanks so much!

Topic		Replies	Views
About the genome and gene version in "DepMap Public 21Q3" CCLE_expression.csv file Q&A	1	415	October 4, 2021
Incomplete genes in 22Q2 ccle_expression.csv and ccle_genecn.csv Q&A	1	253	July 27, 2022
Annotation version for each release Q&A data	4	56	March 12, 2025
Entrez_Gene_Id for TBCE and PINX1 Q&A data	5	713	January 22, 2021
Specific gene model used in 22Q1 CCLE RNASEQ files Q&A	1	340	April 24, 2022

Missing genes in TOPMed gencode30 gtf file

Related topics