Down load of co-dependencies

Continuing the discussion from Download of Top Co-dependencies Pearson correlation coefficients possible?:

I have the same question. @aviad sent us here but folks in my lab can’t find the file. If in supplement of pub or on GEO can you send a link? thanks, Patrick

My guess is that @Aviad was trying to point you to the instructions I wrote on how to download correlates one at a time.

When you say you have the same question, do you mean you’re interested in getting more of the top co-dependencies, or you’re interested in a bulk download of co-dependencies like @Dietrich was asking about at the end of the thread?

We don’t currently expose the co-deps that we’ve computed as a downloadable file. Just the top 100 correlates for all the profiles we compute correlations for comes out to ~15 GB and they’re stored in an internal format designed to facilitate querying. I notice you both are ask specifically for co-dependencies, so perhaps that could be something we provide as a download.

Or would API access to export a set of genes be more useful?

This seems like this is a reoccurring request, and so I’m wondering what the mechanism should be for sharing this.

Hello,

I’m one of the people in @paddisonp 's Lab who is trying to download your pre-computed associations. Thanks for the reply. We are mainly interested in a bulk download of the correlations between CRISPR (Avana) Gene Effect scores (CERES) for many genes (if possible, even all genes that were in this CRISPR library) and gene expression data. The top 100 gene expression correlations for each gene would be fine, although if we are able to download more that would be better.

On a related note, would it be possible to also include the p-values in this bulk download? I noticed that the p-value for a given association is shown when one clicks to expand the “Linear regression” section, but the p-values are currently not included in the downloads.

Thanks so much!
Pia

Both would be ok. API access would be more flexible. But also the possibility of a bulk download as you described would be great.

image001.jpg

An API sounds like the only real option. Flat files that big will be pretty difficult to parse and extract information without specialized tools. Additionally, it may be for certain genes you will want to go much deeper than the top 100 genes/features. Could you make an API that exposes:

  1. Identifying all the features which are significantly associated with a given feature. With the ability to change the filtering criteria:
    • P-value filter
    • q-value (corrected p-value) filter
    • feature types included (dependencies, RNA expression, mutation, etc.)
  2. A way to use number 1 but for a bunch of genes using a uniform set of filters to generate a network.

The results would come out as a csv with:
geneDep1,feature1,stat(to give direction of relationship),p-value,q-value

Does this sound like a plan @pmontgom and @aviad?

I agree, that sounds like a good goal. What I’m now thinking about is how to get to a path to deliver something like that.

We plan our development roadmap quarterly, and while we can often squeeze in bug fixes and small changes in when needed, adding this as described wouldn’t be small. As a result, this is in the queue to be scheduled, and does not yet have any ETA.

Things that I can think would be easier given where we are today:

  1. We have stored in the database the top 100 correlations between several datasets stored in a database. We could add a simple API which will fetch the top 100 correlates for a set of genes. However, you won’t get more than the top 100 and you won’t get a p-value because it’s simply not already stored in the DB.

  2. I could share the python code that the portal uses to compute large tables of correlations from our published files. This would hopefully allow one to compute any correlations you want, but it would require one to be comfortable with python.

I think the original suggestion sounds like a capability that I think people would find useful, but it’ll require making a few changes to implement, and therefore won’t be something that we’ll be able to get to for a while.

Would one of the two “easy” options I listed be a worthwhile short-term substitute?

Thanks,
Phil

Hi Phil,

Thanks so much for the options. Could you please share the python code? Chris Plaisier said he will be able to use that.

Thanks!
Pia

Sure, I posted some code at https://gist.github.com/pgm/ac2ac4c664ef81200ce49133cc4cee02

This code is modified from the portal’s codebase which will compute the top N correlates for given gene effect matrix downloaded from the portal.

Running the following would compute the top 10 co-dependencies:

python scripts/correlation_from_csv.py Achilles_gene_effect.csv Achilles_gene_effect.csv --limit 10 out.csv

As run above it will correlate the same matrix against itself, but one could also use the same script to correlate expression against gene effects, or any other matrix which is in the format that we provide in the DepMap downloads.

thanks,
Phil

1 Like

Thank you so much, Phil!

Pia