Hello,
I’ve got a list of read ids corresponding to a CCLE line RNA-seq fastq file uploaded to SRA. I would like to extract the alignments for these reads from available files that I’ve found on AWS open data. However, I see that the read ids in these bam files are completely different, e.g. “C1EHHACXX130117”, whereas SRA read ids look like “SRR8616111.130117”. Have these alignments been generated from the same original fastq files? If so, is there any way to find these reads in bam files? Thank you in advance!
Sergey
Hi, Sergey. Could you tell me which sample ID/file (in both data sets) you’re looking at?
To be specific, I’m looking for NCIH2228 cell line data. SRA id is SRR8616111, so fastq is here ( SRA Archive: NCBI ) and read ids there are regular SRA ones - SRR8616111.8853031. The RNA-Seq alignment file for this cell line that I’ve found on AWS open data is s3://depmap-omics-ccle/data/rna/bam/G28616.NCI-H2228.1.bam .
According to this wiki page, since at least 2018, SRA has discarded and replaced the read names as part of their data loading process when BAMs are submitted. Even if they’re providing FASTA/FASTQ for a particular succession number, I’m pretty confident that it was a BAM that we originally submitted to them as part of the CCLE study. So for any raw data under the SRP186687 study, we can’t rely on having consistent read IDs.
Thanks a lot, Devin, the matters became much clearer to me. Do you know if by any chance the original fastq files used for BAM generation can still be found somewhere?
The data actually originates from Broad Institute Genomics Platform as BAM/CRAM files and we keep them in those formats to save space. With few exceptions (a very small number of older RNA seq BAMs) the BAM/CRAMs retain unmapped reads, so they should be as good as FASTQs and are easily converted to that format with samtools.