Mutation definition

Could you please tell me how to define mutations in the mutation dataset? What is the normal sample for a cell line? Is a mutation defined as the locus in a cell line different from that in the normal sample?

Hi, unfortunately the historical cell lines lack a matched normal. So we use an arbitrary and unrelated normal sample (a ‘pseudonormal’) for our somatic mutation calling.

Thank you for your reply.

  1. Is the SNV calling conducted with the same “pseudonormal” sample across all cell lines?
  2. Why don’t you just use the reference genome (like hg19) as a background sample?
  3. Why don’t you supply the mutations located in non-coding regions?

Thank you!

Another question,
4. Could you please supply the link to the ‘pseudonormal’ sample so that I can call SNVs on a new cell line?

  1. No, our ICE WES, Agilent WES, and WGS have different pseudo-normals that have been sequenced according to each technology.
  2. The somatic mutation calling pipelines require a normal bam file. It is recommended that this bam file be from the same sequencing technology to avoid detecting artifacts as mutations. Its would be interesting to use reference genome and compare but our pipelines currently are not set up to read in FASTA file as normal.
  3. We’re working on releasing the noncoding mutations for our WGS. There are some limitations in terms of dbGaP permissions for some of our lines. But we are planning to provide this information for a subset of lines in the near future.
  4. For our Agilent WES samples we use:
    I think this is publicly available.
    For our ICE WES samples we use a germline blood from the CCLF project’s samples:
    I’m not sure if this is publicly available, but give it a shot and let me know.
    For WGS we use this GTEx sample:
    I think you’d need dbGaP access for this one and need to get it through the GTEx project.

Thank you for your help. If I understand correctly, for some cellline, it may be sequenced by several technologies (e.g. WGS, WES, RNA-seq). That’s the reason why ‘CCLE_mutations.csv’ there have columns ‘WGS_AC’, ‘SangerRecalibWES_AC’, and so on. However, if different technologies have different pseudo-normals, the reference allele should also be different (at least some of them are different). Why does the file ‘CCLE_mutations.csv’ only contain one column ‘Reference_Allele’ but not ‘Reference_Allele_WGS’, ‘Reference_Allele_SangerRecalibWES’?

No the reference allele is from the reference genome. This is common output for most mutation callers, including Mutect and Strelka used in our pipelines.

Thank you for your help!