Accessing raw sequencing data

Users have asked us if they can access raw sequencing data (bam/fastq files) of our cell lines.

A subset of our raw sequencing data is available online as part of this CCLE publication. The data can be accessed under the accession number PRJNA523380 on the SRA website. Furthermore most of these bam files are available on GDC legacy portal. You should be able to download those and then reconvert to fastq if this is what you are looking for.

For the remaining cell lines, at the moment we are not able to share raw sequencing data due to regulations related to patient privacy and MTAs. We are currently working on establishing a protocol for sharing such data, and hope to have data available in appropriate repositories in the near future.

1 Like

Hi,

Sorry to resurrect an old thread - I was just wondering if there was an update on a protocol for sharing the raw sequencing data (or even bam files) for the remaining cell lines? Thank you!

Best,
Alex

Hey Alex,

Being able to share BAM files of new lines requires very complex legal agreements. This is because the data becomes identifiable (e.g. you could potentially trace relatives of the donor).

However we have recently left another way to access our data. It is very early stage and should not contain any more lines but the data might be a bit more up to date and more complete.

You would need to use google cloud and Terra though:
https://app.terra.bio/#workspaces/fccredits-silver-tan-7621/CCLE_v2/data

Once we are able to share more raw sequencing data, we will make an announcement.

Hope it helps!

Best,

2 Likes

Hi Jérémie,

Thank you for the information, that is very helpful! Indeed this approach gave me around 150 more cell lines than I had in the earlier CCLE analysis.

Interestingly, I found just one single cell line that I had in the previous (dbGap-based, ~2019) release that wasn’t in the updated sample sheet: HCC1588 (ACH-001078). On the DepMap page it does not list RNA-seq for this cell line, but in our database we do have the raw reads from this cell line.

Not a huge discrepancy, but just thought I’d flag it - let me know if a separate thread would be more helpful for that.

Thanks!

1 Like

This is interesting… thanks for reporting this issue. we will update that page as soon as we can!

Best,

Hi, just a quick update on this. I was able to download bam files from 97 of those cell lines that weren’t in my previous dataset. However, 20 of the lines were giving me a 403 access error:

“AccessDeniedException: 403 does not have storage.objects.list access to the Google Cloud Storage bucket.”

Not sure if these are intended to be off limits or not, but figured I’d flag them. The URLs for the bams for these 20 lines were:

gs://cclebams/rnasq_hg38/CDS-010xbm.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-1ZYAcf.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-b68uiO.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-BcQriE.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-d9o7Ib.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-GYZMRK.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-I3QMAf.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-IGrYp9.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-kv9MeK.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-LJhY0o.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-OrDVyD.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-OXLnHs.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-PCHxu4.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-q4Yj5g.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-s5UkkD.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-TIYRBY.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-vQfo3A.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-x4edjw.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-XDECnE.Aligned.sortedByCoord.out.bam
gs://cclebams/rnasq_hg38/CDS-xoxjMy.Aligned.sortedByCoord.out.bam

Are there KO/KD raw data available?

Thanks

Hi, I would like to annotate the CCLE RNAseq data with the recent transcript annotations (GENCODE 46). This is because the available TPM data is processed through 3 year old transcript annotations (GENCODE 38, guessing from depmap_omics repo), which misses updated annotations of about 6000 protein coding genes. That’s about 25% of the coding genes whose transcript annotations are now outdated.

Therefore, I would need to re-process the BAM files with the recent transcript annotations. To do this, I wonder if there are any updates on ease of access to data. I tried to use all the options available for an external user, but could not get anywhere closer to the files or the cloud based workflow. I guess the very complex legal agreements might still be a major limiting factor.

In this case, could you please let me know if depmap_omics team has any plan to update the transcript annotations and make the updated data available on depmap.org?

Thank you.