How to download the data with "wget or curl"?

Shicheng_Guo · August 18, 2022, 7:08pm

Dear Team,

I am trying to download the full dataset with some command line like “wget or curl”. Is there anyway to do it rather than manfully download the file one-by-one and then upload the some Linux servers?

Thanks.

Shicheng

CRISPR_gene_effect.csv
CRISPR_gene_dependency.csv
CCLE_expression.csv
CCLE_gene_cn.csv
CCLE_wes_gene_cn.csv
CCLE_mutations.csv
sample_info.csv
Achilles_metadata.csv
Achilles_gene_effect.csv
Achilles_gene_effect_uncorrected.csv
Achilles_gene_dependency.csv
Achilles_common_essentials.csv
Achilles_guide_efficacy.csv
Achilles_cell_line_efficacy.csv
Achilles_cell_line_growth_rate.csv
CRISPR_dataset_sources.csv
CRISPR_gene_effect.csv
CRISPR_gene_dependency.csv
CRISPR_common_essentials.csv
common_essentials.csv
nonessentials.csv
Achilles_raw_readcounts.csv
Achilles_raw_readcounts_failures.csv
Achilles_logfold_change.csv
Achilles_logfold_change_failures.csv
Achilles_guide_map.csv
Achilles_replicate_map.csv
Achilles_replicate_QC_report_failing.csv
Achilles_dropped_guides.csv
Achilles_high_variance_genes.csv
CCLE_RNAseq_reads.csv
CCLE_expression_full.csv
CCLE_expression.csv
CCLE_expression_transcripts_expected_count.csv
CCLE_expression_proteincoding_genes_expected_count.csv
CCLE_RNAseq_transcripts.csv
CCLE_segment_cn.csv
CCLE_wes_segment_cn.csv
CCLE_gene_cn.csv
CCLE_wes_gene_cn.csv
CCLE_fusions.csv
CCLE_fusions_unfiltered.csv
CCLE_mutations.csv
CCLE_mutations_bool_hotspot.csv
CCLE_mutations_bool_damaging.csv
CCLE_mutations_bool_nonconserving.csv
CCLE_mutations_bool_otherconserving.csv

james-synthego · February 3, 2024, 12:03am

Yes it’s possible, one can figure it out using Google Chrome’s Developer Tools’ Network tab.

Here is some Python 3.10 code (made on 2/2/2024) that shows how to download a few files:

import asyncio
import os
import urllib.parse

import httpx  # ==0.26.0

THIS_DIR = os.path.dirname(__file__)
DEPMAP_FIGSHARE_FILES_BASE_URL = (
    "https://ndownloader.figshare.com/files/"
)
DEPMAP_FIGSHARE_DATASET_IDS: dict[str, int] = {
    "Model_v2.csv": 43746708
}
DEPMAP_API_BASE_URL = "https://depmap.org/portal/api"
DEPMAP_CUSTOM_DATASET_IDS: dict[str, str] = {
    "Expression_Public_23Q4": "expression",
    "Proteomics": "proteomics",
}

def _dump_to_file(
    response: httpx.Response, target_file: str | os.PathLike
) -> None:
    with open(target_file, "wb") as f:
        for chunk in response.iter_bytes(chunk_size=8192):
            f.write(chunk)


async def download_file(
    depmap_id: int | str,
    target_file: str | os.PathLike,
    download_timeout: float | None = 60 * 10.0,
) -> None:
    """Download a DepMap file given its ID to a local file."""
    async with httpx.AsyncClient() as client:
        response = await client.get(
            f"{DEPMAP_FIGSHARE_FILES_BASE_URL}/{depmap_id}",
            follow_redirects=True,
            timeout=download_timeout,
        )
        response.raise_for_status()
    await asyncio.to_thread(_dump_to_file, response, target_file)


async def download_custom_dataset(
    depmap_id: str,
    target_dir: str | os.PathLike = THIS_DIR,
    add_cell_line_metadata: bool = False,
    preparation_wait_interval: float = 10.0,
    download_timeout: float | None = 60 * 10.0,
) -> None:
    """
    Download a DepMap custom dataset given its ID to a local directory.

    Args:
        depmap_id: DepMap ID for the custom dataset to download.
        target_dir: Target directory to download the file to.
        add_cell_line_metadata: Set True to include cell line metadata
            columns in the CSV (e.g. DepMap ID, cell line display name).
        preparation_wait_interval: Sleep interval (sec) while DepMap
            prepares the download.
        download_timeout: Optional timeout (sec) on the file download.
    """
    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"{DEPMAP_API_BASE_URL}/download/custom",
            json={
                "datasetId": depmap_id,
                "dropEmpty": True,
                "addCellLineMetadata": add_cell_line_metadata,
            },
            timeout=15.0,
        )
        response.raise_for_status()
        task_id: str = response.json()["id"]
        while True:
            response = await client.get(
                f"{DEPMAP_API_BASE_URL}/task/{task_id}", timeout=15.0
            )
            response.raise_for_status()
            response_data = response.json()
            match response_data["state"]:
                case "PROGRESS":
                    await asyncio.sleep(preparation_wait_interval)
                case "SUCCESS":
                    download_url: str = response_data["result"][
                        "downloadUrl"
                    ]
                    break
                case _:
                    raise NotImplementedError(
                        f"Unexpected state {response_data['state']} for"
                        " task {task_id}."
                    )
        query_params: dict[str, str] = dict(
            q.split("=", maxsplit=1)
            for q in urllib.parse.urlparse(download_url).query.split(
                "&"
            )
        )
        target_file = os.path.join(
            target_dir, query_params.pop("name")
        )
        response = await client.get(
            download_url,
            timeout=download_timeout,
            params=query_params,
        )
        response.raise_for_status()
    await asyncio.to_thread(_dump_to_file, response, target_file)

pmontgom · March 20, 2024, 6:57pm

Another approach, which also requires a little scripting, is to download files directly from figshare.

To find a release on Figshare, you can click “View full release details” and you should be shown a screen where we list how to cite the dataset:

Clicking the link will take you to the dataset released on Figshare, and figshare has a nice API in addition to it’s UI for downloading files.

So, after navigating to figshare, we can see the article ID for the DepMap 23Q4 release is “24667905” and we’re currently on version “2” by looking at the url you’re directed to:

https://plus.figshare.com/articles/dataset/DepMap_23Q4_Public/24667905/2

Now we can use that information to request all URLs for all files:

$ curl https://api.figshare.com/v2/articles/24667905/versions/2

{"url_public_html": 
   "https://plus.figshare.com/articles/dataset/DepMap_23Q4_Public/24667905/2", 
   "files": [
      {"id": 43347678, 
       "name": "README.txt", 
       "size": 29103, 
       "is_link_only": false, 
       "download_url": "https://ndownloader.figshare.com/files/43347678", ...

You can then use this response by looping through the files field, and download each download_url to a file with the given name field.

Here’s an example in python doing this:

import requests
import subprocess

article_id="24667905"
version="2"

article = requests.get(f"https://api.figshare.com/v2/articles/{article_id}/versions/{version}").json()
for file in article["files"]:
    print(f'downloading  {file["name"]} from {file["download_url"]}')
    subprocess.run(["curl", file["download_url"], "-o", file["name"]], check=True)

Shicheng_Guo · March 20, 2024, 7:33pm

Thank you!! I think it exactly solve my problem as below!

pmontgom · March 20, 2024, 8:33pm

Yes, I forgot you can also just click the “download all” button once you’re at figshare.

Topic		Replies	Views
What's the difference between Achilles_gene_effect.csv and CRISPR_gene_effect.csv? Q&A	2	1049	April 29, 2022
How comparable are the Achilles and CRISPR "gene dependency" files? Q&A	6	1388	June 29, 2022
Whicn one I need to download? Q&A	1	409	October 19, 2021
Genes missing in download data Issues and Bugs	5	342	February 15, 2022
Difference between gene effects/dependency files Q&A data	2	1215	June 25, 2021

How to download the data with "wget or curl"?

Related topics