Where can I find the raw genomics sequencing data?

Many users have asked about how they can access the raw genomics data. While we are still working on a solution for sharing access-controlled data, a large amount of genomics data from over 1000 cell lines that were part of the CCLE effort are publicly available (SRA accession number PRJNA523380 ). To help facilitate reanalysis of these data we have compiled the data into a single workspace in the genomics cloud platform Terra. Furthermore, since the publication of the CCLE paper, we have generated more WES, WGS, and RNAseq data. When available, we have added these data to the workspace and in cases with multiple replicates, we have prioritized the newer bam files. We have also newly generated unfiltered (germline and somatic) variant calls, which are included in the workspace as well (see below for more details). Note that the additional data included in this workspace is only for the cell lines which were part of the CCLE paper. There may be minor differences between the version of the files provided here and what is available on the SRA. This can be due to realignments, resequencing of the line, and deprecation of files due to new QC.

The bam (and associated bai) files as well as unfiltered variant calls for the above-mentioned lines can now be accessed through the following Terra workspace:

https://app.terra.bio/#workspaces/fccredits-silver-tan-7621/CCLE_v2

The data can be found in the ‘sample’ table in the workspace. The columns are marked by the reference genome build (hg19/hg38) used for the alignment of the files, as well as the data type. Note that this workspace is not directly accessible for download or processing. To access the data, you will need to clone the workspace with your own billing account. If you are new to Terra and need a place to start, please check this documentation.

Unfiltered variant calls: The column ‘wgs_cnn_filtered_vcf’ in the workspace sample (under the Data tab) contains the VCF files for unfiltered variant calls. These calls are based on HaplotypeCaller and CNN variant filter applied to bam files after alignment to hg38 reference genome.

The alternative route:
For users who are more familiar with the Google Cloud Platform and prefer to directly interact with the cloud storage, you can use the gsutil tool to access the data. Please note that the bucket is Requester Pays and would require a billing account to access the data. For example, to copy the files to a different location you may use the following command (see here for a detailed explanation):

gsutil -u PROJECT_ID cp gs://BUCKET_NAME/OBJECT_NAME OBJECT_DESTINATION

where PROJECT_ID is the billing project associated with your billing account.

To create a billing account please refer to the Google Cloud documentation.

1 Like