Interpretation of mutation data files


We are using pre-processed mutation files:


However, we are unable to find a perfect overlap between data in these files and all inclusive maf file:

For example, the full file (CCLE_mutations.csv) contains the following record (x = CCLE_mutations.csv):

x[which(x$DepMap_ID == "ACH-000842" & x$Hugo_Symbol == "ERBB2"), ]

       Hugo_Symbol Entrez_Gene_Id NCBI_Build Chromosome Start_position
376648       ERBB2           2064         37         17       37884214
       End_position Strand Variant_Classification Variant_Type Reference_Allele
376648     37884214      +      Nonsense_Mutation          SNP                G
       Tumor_Seq_Allele1 dbSNP_RS dbSNP_Val_Status       Genome_Change
376648                 T                           g.chr17:37884214G>T
       Annotation_Transcript  DepMap_ID cDNA_Change         Codon_Change
376648     ENST00000269571.5 ACH-000842   c.3685G>T c.(3685-3687)Gag>Tag
       Protein_Change isDeleterious isTCGAhotspot TCGAhsCnt isCOSMIChotspot
376648       p.E1229*          True         False         0           False
       COSMIChsCnt ExAC_AF Variant_annotation CGA_WES_AC HC_AC   RD_AC
376648           0      NA           damaging             9:29 167:554
       RNAseq_AC SangerWES_AC WGS_AC
376648    33:118

This appears as “damaging” but its not present in the boolean file: CCLE_mutations_bool_damaging.csv

There appears to be many cases of this. It is likely we are misinterpreting the file names as in what they contain. Any help would be greatly appreciated.

Thank you,

Hi. I suspect that many of these mutations are low allele frequency cases similar to the example that you have shared. In the boolean matrices we drop any mutation with allele frequency below 0.25.

Thank you very much for your answer.