I would like to find the CCLE RNA expression file that has either effective gene sizes or FPKM /RPKM (where estimated RSEM values have been used) to do our own normalizations for CCLE gene expression. I don’t like the way the TPM protein coding RNA files have been generated by taking the larger TPM files for 53,000+ analytes and simply extracting values as is for the subset of protein coding genes. RSEM reads should first be filtered for only protein coding genes and TPM should have then been recalculated for protein coding genes, which would give a different result where all the protein coding gene TPMs from each sample would then add up to the same value of 1 million. To me it looks like this may not have been done properly. Therefore, I would like to perform my own data normalization only using protein coding genes. I can see a gene count and RPKM file under CCLE 2019 but the gene counts are not RSEM expected values and it is unclear if RPKM was calculated with effective or constant gene sizes and using RSEM or just the gene counts file.
Thank you for your feedbacks!
We haven’t released the effective gene sizes (or FPKM) from the RSEM output to the DepMap portal yet, we plan to generate these output by two cell line by gene tables in the next release.
For our latest release data (23Q2), we have both genes and transcripts expected read count (OmicsExpressionGenesExpectedCountProfile.csv, OmicsExpressionTranscriptsExpectedCountProfile.csv) and TPM (OmicsExpressionTranscriptsTPMLogp1Profile.csv) scores which are normalized with the effective gene sizes.
Thank you Alvin for your reply and clarification. I wonder if you can also address the issue of how the TPM are normalized because it seemed to me like all RNA species (~50,000) rather than just protein coding genes were used in that normalization because when I sum the file that is the only way the sums come out to 100%. In the absence of gene sizes, it is near impossible to normalize the protein coding genes from the released expected count data file. I have seen lots of posts on biostars and here where people want to do their own normalizations, and given the caveat with the TPM normalization-- I think this should be a very high priority to release the needed data already. This is an awesome database that thousands of scientists use frequently. What are your thoughts on expediting this release or making at least the gene sizes available somewhere else in the meantime? We just need the expected gene sizes. Thank you.
We have an ongoing biannual release each year, the next release will be around middle September, we will plan to include those data in the release. Thank you for your feedback!