Gene Expression Scatterplots

Hello,

I have a question regarding the scatterplots generated by the Data Explorer for comparing the expression of two genes using the Expression 21Q2 dataset across cell lines. The units in the scatterplot are in log2(TPM+1). However, it is not recommended to use TPM values to compare gene expression between samples.

Therefore, is it valid to use TPM values for these scatterplots?

Thanks for the help,

AaronW

Can you provide more context or point me to a source for why you say it is not recommended to use TPM values to compare gene expression between samples?

There could be a concern that I’m not aware of. We originally had expression in RPKM which I know we cannot compare across samples, and I suspect was part of the motivation for our switch to TPM. I believe TPM should not have the same issue, but if there is an issue with TPM, I’d be interested in learning more.

Thanks,
Phil

Hi Phil,

Here is a paper outlining the reasons why TPM should not be used (or cautiously used) when comparing across samples/tissue types.

Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols (nih.gov)

Unfortunately I am not an expert in this area. However, I saw this paper and several posts on bioinformatics websites saying TPM should not be used across samples. Hoping you can provide more information on how to appropriately use these scatter plots!

Thanks again for all the help,

AaronW

Reading through it seems that they are largely highlighting the issues which one would encounter if you took public data generated through different projects. Different datasources will likely have biases stemming from differences in protocols, and I think that definitely makes sense and is a concern.

I believe the DepMap RNA data should be less prone to biases from different protocols and processing because we are using the same protocol for all the mRNA data that we generate, and processing all RNA data with the same pipeline.

That being said, if you look back on the mRNA data that was generated in the past, that certainly has been generated over the span of many years, and there have definitely been changes to protocols over that time, so there is certainly some risk coming from those changes which occurred over time.

I can circle back to folks here to ask whether they’re aware of biases introduced and whether we’re doing any form of batch correction for different protocols.

But independent of that, I think there’s still evidence that the TPM values across samples are meaningful because we often see them correlate with dependency profiles.

Given we see this alignment across these two orthogonal datasets, that gives me some confidence in the RNA profiles across samples are capturing real biological signal.

Thanks,
Phil

2 Likes