Hello! I ran an analysis of mutation data based on the 22Q2 release, which I’m now updating to use 23Q2 data. I noticed a big difference in damaging mutation calls in the “CCLE_mutations_bool_damaging.csv” (22Q2) and “OmicsSomaticMutationsMatrixDamaging.csv” (23Q2) files for some genes, most notably CYP2D6 and TP53. I filtered both mutation matrices to common cell lines and compared the “0” (no damaging mutation) entries - because 22Q2 contains 0/1, but 23Q2 contains 0/1/2. I found only 29% agreement for CYP2D6 and only 42% for TP53, with lesser degrees of difference for other genes (MUC12, KIR2DL1, TTN, FCGBP, CDKN2A, LAMA1, ARID1A all agree < 90%).
Where does this difference come from, why is it so large for CYP2D6 and TP53, and which version should I believe?
Thanks in advance for your help.