Updated gene names in 21Q4 mutations data

Dear DepMap team,

Greetings, I hope this problem resolved as soon as possible.
Although there was similar problem in Number of genes mutated in cell lines and I already asked in that page, I create a new topic once again for the answer.

I checked that there was a mismatch of the number of genes between the dataset (19536) and the main download page(18784).
When I mannually compared the data, I found the genes having Entrez_ID = 0.
Those have been updated to other HGNC ID.
So I tried to convert them to the newest version of ID as below:

But I realized that the location of mutations was also updated.
As you can see, the start point of DARC is 159176106 but now the location of DARC is 159204875…159206500

As a result, according to the original data, ACKR1(=DARC) were not mutated in any cancer cell lines.

In summary, “Is it okay to simply update the Gene symbol and ID without considering the differentiated location?”

Thank you for your reading this post.

I resolved this problem.
Anyone who faced this problem (i.e. mismatch of the number of genes between entrez_ID and Hugo_ID) can deal with excluding the genes whose entrez_ID=0.

I share the code that I runned:


Also, I suggest changing the number of genes in the mutation dataset.
As you can see, there are 18,783 genes in the dataset without Entrez gene ID = 0.
Because the number of genes was counted by unique() in R contains ID 0, it should be excluded.