Mutation definition

bioinformatics · June 21, 2021, 11:38am

Hi,
Could you please tell me how to define mutations in the mutation dataset? What is the normal sample for a cell line? Is a mutation defined as the locus in a cell line different from that in the normal sample?

jnoorbak · June 28, 2021, 1:56pm

Hi, unfortunately the historical cell lines lack a matched normal. So we use an arbitrary and unrelated normal sample (a ‘pseudonormal’) for our somatic mutation calling.

bioinformatics · July 2, 2021, 12:19pm

Thank you for your reply.

Is the SNV calling conducted with the same “pseudonormal” sample across all cell lines?
Why don’t you just use the reference genome (like hg19) as a background sample?
Why don’t you supply the mutations located in non-coding regions?

Thank you!

bioinformatics · July 3, 2021, 12:41pm

Another question,
4. Could you please supply the link to the ‘pseudonormal’ sample so that I can call SNVs on a new cell line?

jnoorbak · July 6, 2021, 7:22pm

No, our ICE WES, Agilent WES, and WGS have different pseudo-normals that have been sequenced according to each technology.
The somatic mutation calling pipelines require a normal bam file. It is recommended that this bam file be from the same sequencing technology to avoid detecting artifacts as mutations. Its would be interesting to use reference genome and compare but our pipelines currently are not set up to read in FASTA file as normal.
We’re working on releasing the noncoding mutations for our WGS. There are some limitations in terms of dbGaP permissions for some of our lines. But we are planning to provide this information for a subset of lines in the near future.
For our Agilent WES samples we use:
gs://firecloud-tcga-open-access/tutorial/bams/C835.HCC1143_BL.4.bam
I think this is publicly available.
For our ICE WES samples we use a germline blood from the CCLF project’s samples:
gs://fc-38a1a377-72c6-4e90-917f-e4bb709b8f2c/CCLF_RCRF1009-Normal-SM-F3R8L/seq_data_v2/CCLF_RCRF1009GL.bam
I’m not sure if this is publicly available, but give it a shot and let me know.
For WGS we use this GTEx sample:
GTEX-111FC-0001-SM-6WBTJ
I think you’d need dbGaP access for this one and need to get it through the GTEx project.

bioinformatics · July 9, 2021, 6:32am

Thank you for your help. If I understand correctly, for some cellline, it may be sequenced by several technologies (e.g. WGS, WES, RNA-seq). That’s the reason why ‘CCLE_mutations.csv’ there have columns ‘WGS_AC’, ‘SangerRecalibWES_AC’, and so on. However, if different technologies have different pseudo-normals, the reference allele should also be different (at least some of them are different). Why does the file ‘CCLE_mutations.csv’ only contain one column ‘Reference_Allele’ but not ‘Reference_Allele_WGS’, ‘Reference_Allele_SangerRecalibWES’?

jnoorbak · July 9, 2021, 12:45pm

No the reference allele is from the reference genome. This is common output for most mutation callers, including Mutect and Strelka used in our pipelines.

bioinformatics · July 10, 2021, 5:01am

Thank you for your help!

Topic		Replies	Views
Is there ASCAT (allele specific copy number results) for CCLE data and how to set a control sample for CNV detection with WES data Q&A	4	719	November 16, 2021
Panel-of-normal data for variant calling Q&A omics , data	1	517	April 23, 2021
Incomplete mutation calling Q&A	4	287	May 10, 2022
SVs, noncoding mutations for CCLE data Q&A data	1	52	October 23, 2024
Mutations reference genome Q&A	1	401	December 7, 2022

Mutation definition

Related topics