Which one of the expression data is proper for machine learning?

SongSong · June 25, 2024, 6:41am

Hi, thank you for your effort in launching the newest dataset version 2024Q2.
But I’m a little confused about the two kinds of gene expression data: original and batch normalized.
If I want to use the gene expression data for the machine learning model training, which one should I use?
I checked that the original has some 0 values and the batch normalized dataset doesn’t have 0 but negative values.
Can anyone recommend the dataset?

simz · June 25, 2024, 3:38pm

Hi,

I’d recommend using the batch-corrected data, as we are planning to phase out the non-batch corrected data in the future. For details, please see the pdf attached at the end of the 24Q2 release announcement.

Thanks,
Simone

SongSong · September 3, 2024, 5:55am

Dear Simone,

Hi, thank you for your kind reply.
Although I already selected your answer as the solution to this question, I have one more question about the dataset; not expression, but copy number.
To my knowledge, the previous version of copy number(CN) values was provided with the log2-transformed version.
But from the 24Q2, the relative copy number matrix is no longer log2 transformed.
If I want to use CN data with the gene expression data which is log2-transformed TPM value for the machine learning data, should I use PortalOmicsCNGeneLog2.csv for the consistent range of value or use OmicsCNGene.csv to keep the original data?

Thank you for reading this question.
I look forward to your reply.

Sincerely,
Songyeon

simz · September 3, 2024, 3:19pm

Hi Songyeon,

PortalOmicsCNGeneLog2 is the log2-transformed version of OmicsCNGene. They are essentially the same data, so as far as I know, you may use either depending on what you need for your particular analysis.

Best,
Simone

SongSong · September 4, 2024, 2:10am

Thank you for your kind reply.
I’ll think about it a little more.

Topic		Replies	Views
Clarification on Negative Values in Log-Transformed Gene Expression Data Q&A omics	1	276	January 30, 2025
Conflicting values in CN data Issues and Bugs	1	72	January 15, 2025
Values in OmicsCNGene.csv 24Q2 release Issues and Bugs	1	213	July 3, 2024
Batch correction best practices for DepMap omics expression data 25Q3 Q&A omics , data	2	19	February 5, 2026
Normalization -TMM Q&A omics	3	114	April 16, 2025

Which one of the expression data is proper for machine learning?

Related topics