comparison between raw counts and log TPM values

Hello there,

I downloaded first the log TPM values and plotted them for a specific gene and found that it is expressed. Then, I have to use another normalization method and downloaded the raw counts. Checking the raw counts of this exact same gene in the same cell line, I found it zero.

How can this be explained?

the files I downloaded: OmicsExpressionTPMLogp1HumanAllGenes.csv and OmicsExpressionRawReadCountHumanAllGenesStranded.csv.

Thank you in advance.

In my opinion, this is because the expression data was calculated using Salmon 1.10.0 (please refer to the 25Q2 release note), which employs a probabilistic model. But since I explained it based on the latest version of DepMap data, it would be better to clarify the version of the dataset.

Thank you for your reply.

I am using the current release data (25Q2).
You mean the TPM values are calculated using raw data generated using the Salmon pipeline?

I checked the release notes for the current and previous releases, I could not see a clear explanation on how the TPM values were generated. In the current release notes, they state only how the read counts were generated; they used STAR. Do you think TPM values are generated based on Salmon quantification?

Maybe yes. It would be better to check the original paper.

Hi,

As of 25q2, our TPM values were generated using Salmon. The discrepancy you are seeing might be because Salmon allocates ambiguous reads using EM, while STAR raw counts ignores them since it only uses uniquely mapping reads.

We are happy to look into this further if you’d be willing to share some of the genes that show this kind of behavior.

Thanks!
Simone

Thank you for reply!

No need, just wanted to clarify the discrepancy between the two datasets.