According to the readme, v20.data.curves_post_qc.txt contains AUC sensitivity scores and other information “for each cancer cell line and each compound”. That is the information I seek. However, actually it does not seem to contain information about cell lines. How can I know in which cell line each combination of master_cpd_id and area_under_curve was measured?
There is something called experiment_id however it is unclear what this denotes exactly. First I assumed that an experiment involves exactly one and only one master_ccl_id? If that were true it would be possible to figure out the cell line like this. But experiment_id does not map uniquely onto master_ccl_id, according to v20.meta.per_experiment.txt. If it would map uniqely, I could further link master_ccl_id to ccl_name (info in v20.meta.per_cell_line.txt).
I just want to arrive at entries that would look like this:
ACH-000879 sertindole 0.908
That would tell me that serindole had an AUC of 0.908 in the ACH-000879 cell line.
I do not know what the first pattern of cell line identifier is called… it’s the same as in PRISM.
It would be awesome if I could get some advice how to put the information together, thank you!
Thank you Jermiah for the information, however I cannot make sense of these matlab code. My question is not so much about implementation approaches anyway, (I for one like to program in python nowadays), but about concepts. I feel I have described the data structures and issues above in detail… I wish someone from DepMap / CTRP who compiled these data would answer.
This is a dataset which was generated outside of the DepMap project and we imported into the portal, but we did this years ago and the developer who did this has since left the project and I don’t recall the details myself. The github repo that @jermiah_joseph referred to was written by people that were involved in the CTD^2 project which generated this data, so they are likely to know more information then the DepMap project. You may want to consider reaching out to the authors of that package for details.
Looking at our old script that was used to transform these files, it looks like we were doing what you describe (mapping experiment_id → master_ccl_id → ccl_name) but you’re saying that experiment_id does not map to a single master_cl_id?
I just pulled the data and it looks to me like every experiment_id resolves to a unique master_cl_id.
And I get zero experiments which have multiple distinct experiment_ids. If you’re seeing something different, can you provide an example of an experiment which has a non-unique mapping so I can understand where my logic is wrong?