CCLE and GDSC expression data correlation low

Hi everyone,

I am working with CCLE and GDSC data. Each of these datasets seem to have ~1000 cell lines, and when I checked, about ~600 cell lines seemed to have consistent names between CCLE/GDSC, so I assumed these cell lines are commonly used ones and are overlapping between the two datasets.

However, when I measured the correlation of expression levels of the overlapping cell lines via Pearson correlation coefficient, most of these cell lines have correlation very close to zero. Of the ~600 coefficients, corresponding to each cell line, the largest one was 0.08, and most are less than 0.01.

I thought I must be missing something here and repeated this calculation with only a subset of genes, such as the top 2000 variable genes and COSMIC consensus gene sets, but the result remains similar, and the expression levels of these cell lines seem to be largely uncorrelated. CCLE uses RPKM and GDSC uses FPKM, but I would still think they would be much more correlated.

Could I please have any imput on what could be happening here? I would really appreciate any insight.


Hi. Which file from the portal are you using? We report expression using log2(TPM+1). I think you’d need to do a conversion between that and FPKM in order to get correct correlations.