I am working with a single-cell RNA-seq experiment where two cell lines, HT29 (COAD) and KP4 (Pancreas), were multiplexed together before sequencing. The data has been aligned to the GRCh38 reference genome, and now I need genotype information to run demuxlet for sample demultiplexing.
I am looking for VCF (or compatible genotype files from which I can generate VCF) for these two cell lines.
What I found are: WGS (but from the sole HT29, not KP4) here and HC for both HT29 and KP4 (but unfortunately, based on this info, it looks that the HC files were generated from HG19).
Alternatively, I was thinking of one of the following files: OmicsSomaticMutationsMAFProfile.maf and
OmicsSomaticMutationsProfile.csv (the latter being a MAF-like file and not MAF, therefore perhaps less conveniently convertable to VCF).
I would greatly appreciate any suggestion to solve this problem.
OmicsSomaticMutationsMAFProfile.maf should be a reasonable start. It has the same variants as OmicsSomaticMutationsProfile.csv, except like you said, it’s a MAF and has less of the annotation columns available in OmicsSomaticMutationsProfile.csv. The caveat is that, to generate this file, we have already filtered out the vast majority of germline/non-coding mutations (see our documentation for details), which may or may not have a significant impact on demultiplexing.
If you do decide to use this file, we recommend pseudobulking your scRNA-seq data after the fact and comparing their expression to DepMap’s bulk expression to confirm the effectiveness of demultiplexing. To do this you might want to restrict the correlation analysis to the top variable genes instead of all genes.
In terms of the other route, we are currently in the process of resequencing CCLE cell lines with WGS, and the WGS profile for KP4 is likely going to be made available in the next update hopefully in a month. Please stay tuned and refer to our resources page for details on how to access our sequencing data.