Where can I find the raw genomics sequencing data?

jmmcfarl · April 25, 2022, 7:16pm

Many users have asked about how they can access the raw genomics data. While we are still working on a solution for sharing access-controlled data, a large amount of genomics data from over 1000 cell lines that were part of the CCLE effort are publicly available (SRA accession number PRJNA523380 ). To help facilitate reanalysis of these data we have compiled the data into a single workspace in the genomics cloud platform Terra. Furthermore, since the publication of the CCLE paper, we have generated more WES, WGS, and RNAseq data. When available, we have added these data to the workspace and in cases with multiple replicates, we have prioritized the newer bam files. We have also newly generated unfiltered (germline and somatic) variant calls, which are included in the workspace as well (see below for more details). Note that the additional data included in this workspace is only for the cell lines which were part of the CCLE paper. There may be minor differences between the version of the files provided here and what is available on the SRA. This can be due to realignments, resequencing of the line, and deprecation of files due to new QC.

The bam (and associated bai) files as well as unfiltered variant calls for the above-mentioned lines can now be accessed through the following Terra workspace:

https://app.terra.bio/#workspaces/fccredits-silver-tan-7621/CCLE_v2

The data can be found in the ‘sample’ table in the workspace. The columns are marked by the reference genome build (hg19/hg38) used for the alignment of the files, as well as the data type. Note that this workspace is not directly accessible for download or processing. To access the data, you will need to clone the workspace with your own billing account. If you are new to Terra and need a place to start, please check this documentation.

Unfiltered variant calls: The column ‘wgs_cnn_filtered_vcf’ in the workspace sample (under the Data tab) contains the VCF files for unfiltered variant calls. These calls are based on HaplotypeCaller and CNN variant filter applied to bam files after alignment to hg38 reference genome.

The alternative route:
For users who are more familiar with the Google Cloud Platform and prefer to directly interact with the cloud storage, you can use the gsutil tool to access the data. Please note that the bucket is Requester Pays and would require a billing account to access the data. For example, to copy the files to a different location you may use the following command (see here for a detailed explanation):

gsutil -u PROJECT_ID cp gs://BUCKET_NAME/OBJECT_NAME OBJECT_DESTINATION

where PROJECT_ID is the billing project associated with your billing account.

To create a billing account please refer to the Google Cloud documentation.

fnigsch · April 20, 2023, 7:34pm

@jmmcfarl Thanks for this post - for the alternative route (via GCP): what are the storage buckets where to find the various data sets? I looked for this information on the DepMap site including this forum but could not find anything.

simz · April 20, 2023, 7:41pm

Hi @fnigsch ,

Our data storage bucket has fine-grained access control. Public users do not have read access to the entire bucket, but are able to download some individual files that are made public in this bucket. The locations of these public files can be found in the Terra workspace above.

Thanks,
Simone

aksaw · May 17, 2023, 6:16pm

Hi,
Many thanks for creating a post related to raw sequencing genomic data availability. As per the post, CCLE project data are publically available on NCBI (SRA accession number [PRJNA523380]). I found processed data (eg, OmicsExpressionTranscriptsTPMLogp1Profile.csv, OmicsExpressionTranscriptsExpectedCountProfile.csv, etc,.) of the cell-lines belonging to RBE_BILIARY_TRACT, EGI1_BILIARY_TRACT and TFK1_BILIARY_TRACT are available but didn’t find raw data on NCBI under the accession number PRJNA523380. I shall be thankful to you if you let us know the way for getting those raw data.

simz · May 25, 2023, 7:29pm

Hi @aksaw,

Thanks for reaching out.

Currently we are only allowed to share raw sequencing data for cell lines included in the CCLE project, which is a subset of DepMap data. However, RBE_BILIARY_TRACT, EGI1_BILIARY_TRACT and TFK1_BILIARY_TRACT were not part of the CCLE project, which is why we can only release processed data for them. We are actively working on releasing more raw data in an access-controlled manner in the future through dbGaP, so please stay tuned!

Best,
Simone

Statue6877 · August 30, 2023, 1:34pm

Is there any way to access the public data in GCP in an authenticated manner, still using a billing account ID, rather than only using the billing ID? The sole GCP transfer method authorized by my security department is one that requires authentication.

Thanks in advance!

simz · September 5, 2023, 3:25pm

Hi @Statue6877, Can you elaborate on what you mean by authenticated? I’m not sure I understand the issue you’re having.

Thanks!
Simone

Statue6877 · September 5, 2023, 4:23pm

Hi Simone. So, my company only has one approved tool for transferring data to/from GCP. This tool requires the use of p12 key files.

Of course, the key files shouldn’t be necessary in this case, since the objects we want to retrieve are public. We should only need the billing account ID. And I know it works fine with just billing ID, since I was able to pull object metadata from my own GCP account.

Unfortunately, there is simply no way to make our data transfer tool work without p12 keys. I’m hoping it is possible for you to generate a p12 key pair that we can use, so we can retrieve the objects with our standard tool. Thanks for your help!

pmontgom · September 5, 2023, 9:12pm

Hello,

If I understand what you’re describing, I believe the lack of a p12 key isn’t actually a problem. You can create one yourself because any p12 key associated with a service account would work. The key is only used to authenticate to google, and since these files are all public, it doesn’t matter which user you are authenticated as.

However, I suspect you may run into a problem with our bucket being configured as “requester pays”. Because the BAM files are large and google charges for data egress, there’s the potential for a non-trivial cost associated with users downloading this data, and thus, we require a billing account ID so that google can charge the downloader instead of us.

If the tool you’re using does not allow you to specify a billing account ID, then we may have a problem because this is the only mechanism we have for the requestor to pay for the transfer.

However, if your tool does allow you to specify a billing account ID, then you should be able to create a service account with the appropriate permissions on that billing account, create a p12 key for the account and then the transfer should work.

Thanks,
Phil

Lukas_Simon · January 18, 2025, 11:31pm

Thanks for making the data available!

I cannot find the column ‘wgs_cnn_filtered_vcf’ in the terra workspace.

simz · January 21, 2025, 3:13pm

Hi,

As the result of a recent update to our mutation pipeline, mutect2 is now our main mutation caller. Accordingly, we removed the cnn_filtered_vcf columns from this workspace, and the unfiltered variant calls can now be found under columns mutect2_vcf_wgs and mutect2_vcf_wes.

Thanks,
Simone

firegoby · February 12, 2025, 2:24am

hello - i haven’t worked with mutect vcf before, I downloaded a WGS vcf file and hope to get all somatic mutations, could you please share some info/commands to do this?

https://gatk.broadinstitute.org/hc/en-us/articles/
is it filterMutectCalls?
thank you

Diego · February 14, 2025, 1:16pm

Hi DepMap team!
I was looking for the genomic raw data for some cell lines that were reciently added to DepMap. I found some of them in Terra, but there were others I could not find there (these seem to be addded to DepMap in the 24Q4 release). Do you know if this ones is going to be added soon? Or is there any other place where I could find raw genomic data for these cell lines?
Thank you very much in advance!
Diego.

simz · February 18, 2025, 4:54pm

Hi,

Please see our resources page for details on how to access DepMap sequencing data.

Thanks,
Simone

simz · February 18, 2025, 4:58pm

Hi,

The VCF files in the workspace contain both somatic and germline mutations. Since DepMap only has cell line data, we don’t have matched normals and there isn’t an easy way to call something somatic. We provide a somatic mutation file on the DepMap portal, in which we do our best to remove germline mutations by way of looking up their allele frequency in gnomAD. Please see our documentation for details on how we filter mutations.

Thanks,
Simone

firegoby · February 18, 2025, 5:52pm

thanks Simon. However I found the vcf file on DepMap portal only contains functional (like missense) somatic mutations, and I’d like to get the other non-synonymous somatic mutations as well

Topic		Replies	Views
Accessing raw sequencing data Q&A omics , data	8	5524	September 8, 2024
Where can I find Raw Transcriptomic data for cell lines not available through SRA? Q&A omics , data	3	304	September 8, 2023
CCLE's data access through Terra Q&A data	1	748	September 8, 2020
Where to find RNAseq data Q&A data	5	1375	July 13, 2022
CCLE cell line RNA-Seq raw read counts Q&A data	1	267	January 2, 2024

Where can I find the raw genomics sequencing data?

Related topics