Hi. I was trying to calculate the frequency of mutated genes from CCLE_mutations.csv file of 22Q1 release, then ranking genes by frequency.
But I was really confused by the data, since the dataset seems to be specific to mutation sites, I don’t know what is the right way to combine those together.
I wonder if someone could explain a little bit about it.
Or if there are other databases could provide such information?
The data is aggregated over all available sequencing types for a given sample. Some samples have more sequencing types than others. (also, note that we are only releasing somatic coding mutations).
So a simple method would be to just to take any available mutations in our dataset un-regarding of the sequencing type.
Whatever happens your analysis will be biased by the fact that different samples have different set of sequencing, each covering more or less well a specific set of genes.
Thank you so much for your reply. That really helps.
And I also have a question about ALT:REF. If I understand correctly, that ratio is the number of mutation allele by normal/reference allele, right?
So when I calculate mutation freqency of a gene, should I sum number of total ALT and REF of all entries of a gene for calculation, or I should just sum the number of entries of mutations for a gene? Which way is more reasonable and unbiased?
And when REF=0, does that really means no REF allele found at such place in sequencing?
Yes this is right. and ref 0 really means no reads found with that mutation.
I think both metrics represent different things and it depends on your underlying question and the point you want to make. But from what you are saying I would be inclined in computing mutation frequency as the sum of all mutations that have a high enough allele frequency.