Extracting AUC Data for Drug-Cell Line Combinations from CTRP CTD^2 Dataset

I am using the data “CTRP CTD^2 All Files”, contained in CTRPv2.0_2015_ctd2_ExpandedDataset.zip.

I am trying to figure out for each combination of drug and cell line what the AUC was.

Please how can I find out this information?

I tried v20.data.curves_post_qc.txt… it contains a column area_under_curve but I do not understand how I can match this to CCLE names or drugs?!

According to the readme, v20.data.curves_post_qc.txt contains AUC sensitivity scores and other information “for each cancer cell line and each compound”. That is the information I seek. However, actually it does not seem to contain information about cell lines. How can I know in which cell line each combination of master_cpd_id and area_under_curve was measured?

There is something called experiment_id however it is unclear what this denotes exactly. First I assumed that an experiment involves exactly one and only one master_ccl_id? If that were true it would be possible to figure out the cell line like this. But experiment_id does not map uniquely onto master_ccl_id, according to v20.meta.per_experiment.txt. If it would map uniqely, I could further link master_ccl_id to ccl_name (info in v20.meta.per_cell_line.txt).

I just want to arrive at entries that would look like this:

ACH-000879 sertindole 0.908

That would tell me that serindole had an AUC of 0.908 in the ACH-000879 cell line.

I do not know what the first pattern of cell line identifier is called… it’s the same as in PRISM.

It would be awesome if I could get some advice how to put the information together, thank you!

Hi there,

I am looking into processing this dataset as well, but havent gotten to it yet. For reference, there is this repo from one of the authors:

which looks like it maps what you are looking for, but in Matlab.

At some point I will likely process it in R so if you do make some progress on this please do share how you went about it!

Thank you Jermiah for the information, however I cannot make sense of these matlab code. My question is not so much about implementation approaches anyway, (I for one like to program in python nowadays), but about concepts. I feel I have described the data structures and issues above in detail… I wish someone from DepMap / CTRP who compiled these data would answer.

This is a dataset which was generated outside of the DepMap project and we imported into the portal, but we did this years ago and the developer who did this has since left the project and I don’t recall the details myself. The github repo that @jermiah_joseph referred to was written by people that were involved in the CTD^2 project which generated this data, so they are likely to know more information then the DepMap project. You may want to consider reaching out to the authors of that package for details.

Looking at our old script that was used to transform these files, it looks like we were doing what you describe (mapping experiment_idmaster_ccl_idccl_name) but you’re saying that experiment_id does not map to a single master_cl_id?

I just pulled the data and it looks to me like every experiment_id resolves to a unique master_cl_id.

Concretely I did:

count_per_group = per_experiment[["experiment_id","master_ccl_id"]].drop_duplicates().groupby("experiment_id").apply("count")
len(count_per_group[count_per_group["master_ccl_id"] > 1])

And I get zero experiments which have multiple distinct experiment_ids. If you’re seeing something different, can you provide an example of an experiment which has a non-unique mapping so I can understand where my logic is wrong?

Thanks,
Phil