Which one of the expression data is proper for machine learning?

Hi, thank you for your effort in launching the newest dataset version 2024Q2.
But I’m a little confused about the two kinds of gene expression data: original and batch normalized.
If I want to use the gene expression data for the machine learning model training, which one should I use?
I checked that the original has some 0 values and the batch normalized dataset doesn’t have 0 but negative values.
Can anyone recommend the dataset?

Hi,

I’d recommend using the batch-corrected data, as we are planning to phase out the non-batch corrected data in the future. For details, please see the pdf attached at the end of the 24Q2 release announcement.

Thanks,
Simone

Dear Simone,

Hi, thank you for your kind reply.
Although I already selected your answer as the solution to this question, I have one more question about the dataset; not expression, but copy number.
To my knowledge, the previous version of copy number(CN) values was provided with the log2-transformed version.
But from the 24Q2, the relative copy number matrix is no longer log2 transformed.
If I want to use CN data with the gene expression data which is log2-transformed TPM value for the machine learning data, should I use PortalOmicsCNGeneLog2.csv for the consistent range of value or use OmicsCNGene.csv to keep the original data?

Thank you for reading this question.
I look forward to your reply.

Sincerely,
Songyeon

Hi Songyeon,

PortalOmicsCNGeneLog2 is the log2-transformed version of OmicsCNGene. They are essentially the same data, so as far as I know, you may use either depending on what you need for your particular analysis.

Best,
Simone

1 Like

Thank you for your kind reply.
I’ll think about it a little more. :slight_smile: