Where can I find the raw genomics sequencing data?

Many users have asked about how they can access the raw genomics data. While we are still working on a solution for sharing access-controlled data, a large amount of genomics data from over 1000 cell lines that were part of the CCLE effort are publicly available (SRA accession number PRJNA523380 ). To help facilitate reanalysis of these data we have compiled the data into a single workspace in the genomics cloud platform Terra. Furthermore, since the publication of the CCLE paper, we have generated more WES, WGS, and RNAseq data. When available, we have added these data to the workspace and in cases with multiple replicates, we have prioritized the newer bam files. We have also newly generated unfiltered (germline and somatic) variant calls, which are included in the workspace as well (see below for more details). Note that the additional data included in this workspace is only for the cell lines which were part of the CCLE paper. There may be minor differences between the version of the files provided here and what is available on the SRA. This can be due to realignments, resequencing of the line, and deprecation of files due to new QC.

The bam (and associated bai) files as well as unfiltered variant calls for the above-mentioned lines can now be accessed through the following Terra workspace:

https://app.terra.bio/#workspaces/fccredits-silver-tan-7621/CCLE_v2

The data can be found in the ‘sample’ table in the workspace. The columns are marked by the reference genome build (hg19/hg38) used for the alignment of the files, as well as the data type. Note that this workspace is not directly accessible for download or processing. To access the data, you will need to clone the workspace with your own billing account. If you are new to Terra and need a place to start, please check this documentation.

Unfiltered variant calls: The column ‘wgs_cnn_filtered_vcf’ in the workspace sample (under the Data tab) contains the VCF files for unfiltered variant calls. These calls are based on HaplotypeCaller and CNN variant filter applied to bam files after alignment to hg38 reference genome.

The alternative route:
For users who are more familiar with the Google Cloud Platform and prefer to directly interact with the cloud storage, you can use the gsutil tool to access the data. Please note that the bucket is Requester Pays and would require a billing account to access the data. For example, to copy the files to a different location you may use the following command (see here for a detailed explanation):

gsutil -u PROJECT_ID cp gs://BUCKET_NAME/OBJECT_NAME OBJECT_DESTINATION

where PROJECT_ID is the billing project associated with your billing account.

To create a billing account please refer to the Google Cloud documentation.

1 Like

@jmmcfarl Thanks for this post - for the alternative route (via GCP): what are the storage buckets where to find the various data sets? I looked for this information on the DepMap site including this forum but could not find anything.

Hi @fnigsch ,

Our data storage bucket has fine-grained access control. Public users do not have read access to the entire bucket, but are able to download some individual files that are made public in this bucket. The locations of these public files can be found in the Terra workspace above.

Thanks,
Simone

Hi,
Many thanks for creating a post related to raw sequencing genomic data availability. As per the post, CCLE project data are publically available on NCBI (SRA accession number [PRJNA523380]). I found processed data (eg, OmicsExpressionTranscriptsTPMLogp1Profile.csv, OmicsExpressionTranscriptsExpectedCountProfile.csv, etc,.) of the cell-lines belonging to RBE_BILIARY_TRACT, EGI1_BILIARY_TRACT and TFK1_BILIARY_TRACT are available but didn’t find raw data on NCBI under the accession number PRJNA523380. I shall be thankful to you if you let us know the way for getting those raw data.

Hi @aksaw,

Thanks for reaching out.

Currently we are only allowed to share raw sequencing data for cell lines included in the CCLE project, which is a subset of DepMap data. However, RBE_BILIARY_TRACT, EGI1_BILIARY_TRACT and TFK1_BILIARY_TRACT were not part of the CCLE project, which is why we can only release processed data for them. We are actively working on releasing more raw data in an access-controlled manner in the future through dbGaP, so please stay tuned!

Best,
Simone

Is there any way to access the public data in GCP in an authenticated manner, still using a billing account ID, rather than only using the billing ID? The sole GCP transfer method authorized by my security department is one that requires authentication.

Thanks in advance!

Hi @Statue6877, Can you elaborate on what you mean by authenticated? I’m not sure I understand the issue you’re having.

Thanks!
Simone

Hi Simone. So, my company only has one approved tool for transferring data to/from GCP. This tool requires the use of p12 key files.

Of course, the key files shouldn’t be necessary in this case, since the objects we want to retrieve are public. We should only need the billing account ID. And I know it works fine with just billing ID, since I was able to pull object metadata from my own GCP account.

Unfortunately, there is simply no way to make our data transfer tool work without p12 keys. I’m hoping it is possible for you to generate a p12 key pair that we can use, so we can retrieve the objects with our standard tool. Thanks for your help!

Hello,

If I understand what you’re describing, I believe the lack of a p12 key isn’t actually a problem. You can create one yourself because any p12 key associated with a service account would work. The key is only used to authenticate to google, and since these files are all public, it doesn’t matter which user you are authenticated as.

However, I suspect you may run into a problem with our bucket being configured as “requester pays”. Because the BAM files are large and google charges for data egress, there’s the potential for a non-trivial cost associated with users downloading this data, and thus, we require a billing account ID so that google can charge the downloader instead of us.

If the tool you’re using does not allow you to specify a billing account ID, then we may have a problem because this is the only mechanism we have for the requestor to pay for the transfer.

However, if your tool does allow you to specify a billing account ID, then you should be able to create a service account with the appropriate permissions on that billing account, create a p12 key for the account and then the transfer should work.

Thanks,
Phil