Large difference in damaging mutation matrix between 22Q2 and 23Q2 for CYP2D6 and TP53

Hello! I ran an analysis of mutation data based on the 22Q2 release, which I’m now updating to use 23Q2 data. I noticed a big difference in damaging mutation calls in the “CCLE_mutations_bool_damaging.csv” (22Q2) and “OmicsSomaticMutationsMatrixDamaging.csv” (23Q2) files for some genes, most notably CYP2D6 and TP53. I filtered both mutation matrices to common cell lines and compared the “0” (no damaging mutation) entries - because 22Q2 contains 0/1, but 23Q2 contains 0/1/2. I found only 29% agreement for CYP2D6 and only 42% for TP53, with lesser degrees of difference for other genes (MUC12, KIR2DL1, TTN, FCGBP, CDKN2A, LAMA1, ARID1A all agree < 90%).

Where does this difference come from, why is it so large for CYP2D6 and TP53, and which version should I believe?
Thanks in advance for your help.

Since I didn’t get a reply here (:frowning_face:), I looked more into this issue myself - specifically regarding TP53. To summarise the differences: In 22Q2, 21% of 1770 cell lines were considered to have damaging TP53 mutations, compared to 49% of 1738 cell lines in 23Q2 (41% homozygous, 8% heterozygous). Most (76%) of the cell lines annotated as “TP53 damaged” in 22Q2 were also annotated as such in 23Q2, but a large fraction (41%) of those not previously so annotated gained the annotation by 23Q2.

The short explanation is that the DepMap mutation (calling and annotation) pipeline changed between the versions. There is a document here that describes the changes:

The main difference is what counts as a “damaging mutation”. In 22Q2, any variant classified as “nonsense mutation”, “frame-shift insertion”/“… deletion” or “splice site” was considered “damaging” – and any cell line with one or more such variants in a gene was annotated accordingly.
In 23Q2 (and 22Q4, when the change happened) this still applies (except for splice sites), but there is an additional criterion of “likely loss-of-function (LoF)” that can cause a variant to be considered “damaging”. This explains why the fraction of “TP53 damaged” cell lines increased overall. For TP53, many missense mutations that were not considered “damaging” in 22Q2 are now considered “likely LoF” and thus “damaging”.

Unfortunately I haven’t found much information about the “likely LoF” annotation. The document linked above just says: “likely to be LoF driver in a tumor supressor gene or dann_score above .96” (description of the “LikelyLoF” column on page 11).

I’m still very interested in why the changes in the pipeline made such a big difference specifically for CYP2D6 and TP53. It would also be great to get more information about the “likely to be LoF driver in a tumor supressor gene” part of the “LikelyLoF” annotation (which might answer the previous question).

I am also interested in this change, for 23Q4, the mutation calling workflow was described here: 23Q4 mutation pipeline documentation (storage.googleapis.com)

Damaging mutation is determined by LikelyLOF, which was defined by:
If variant’s mutation effect is “Likely Loss-of-function” or “Loss-of-function” in OncoKB and its VEP impact is “HIGH”

Any comments from the Broad team will be super helpful!