I use the ensembl IDs for analyzing the expression datasets (specifically the OmicsExpressionGenesExpectedCountProfile.csv) so that I can merge them with my own RNA-seq datasets that were aligned and counts generated using STAR + RSEM. I have noticed with the latest depmap22Q4 release that a significant number of the ensembl ids are retired (as in when I search them in the ensembl website they say they are retired and no longer in the new builds). Examples being SOD2 (ENSG00000112096), new id is ENSG00000291237, and HOMEZ (ENSG00000215271), new id is ENSG00000290292.
Is there a reason I am seeing this? I assumed since the new omics data were realigned with the latest STAR+RSEM versions that the alignment builds would be updated as well.
Did a little more digging, most of the unmatched ones are novel proteins and snoRNA/lncRNA/etc. A handful are retired ensembl IDs like I mentioned before so it doesn’t seem to be a huge issue.
We are currently using indices generated with Gencode v38 for STAR and RSEM, and the Nov2020 version of Biomart to map Ensembl IDs to gene names. They are not the most up-to-date at the moment, and that should be why there are unmatched novel proteins and retired IDs in our data sets. We hope to update them in the future to minimize the number of mismatches.