I noticed a discrepancy between damaging_mutations.csv (downloaded from “Downloads/Custom Downloads” page) and OmicsSomaticMutationsMatrixDamaging.csv (downloaded from “Downloads/All Data” page). For example, damaging_mutations.csv included 17514 genes and OmicsSomaticMutationsMatrixDamaging.csv included 18748 genes.
I wonder if the pipeline that used to generate these two files are the same, if not, where the discrepancy comes from? To analyze damaging mutations across cell lines, which file is recommended to use?
The damagine_mutations.csv file from Custom Downloads is an export of the table from the DepMap portal’s database, which was loaded from OmicsSomicsMutationsMatrixDamaging.csv.
However, a known issue is that the DepMap portal is using a definition of protein coding genes which is different then the one that the mutation calling pipeline is using. Any genes (NCBI gene IDs) which not in the DepMap portal’s database do not have their data loaded and so will not be present in the export, hence, it’s not suprising to see the number of genes exported is smaller then what was in the original (OmicsSomicsMutationsMatrixDamaging.csv ) file.
We’re trying to incrementally move all of our pipelines to use a single standard list of gene identifiers, but it has been a gradual process. I think I’ve heard that in the next release OmicsSomaticMutationsMatrixDamaging.csv will be using the same set of gene IDs that the portal uses, so the discrepancy should go away in the future.