Hello Community,
I´m not sure if this exact question has been asked before. After going through some posts I could not find an answer which is why I´m trying it this way.
The main question is: How many compounds have been tested and how many cell lines are included in the dataset? Or differently: How do you count the compounds and cell lines?
When I read the description provided for the data (specifically the file ‘Repurposing_Public_24Q2_Extended_Primary_Data_Matrix.csv’) it says that there are 906 cell lines tested against (if I´m understanding correctly) 4518 (from the primary prism screen) 1280 (from REP1M) and 234 (from REP300). In total this would make 6032. I am using the ‘Repurposing_Public_24Q2_Extended_Primary_Compound_List.csv’ file to get the metadata of the compounds (of main interest are the names).
I understand that some compounds have multiple BRD IDs due to, for example, different vendors involved.
I counted the names for all compounds and found 6504 drug names which still does not match with the provided numbers. (I even retrieved the smiles and inchis but always had more than 6032 compounds)
Further, the Repurposing_Public_24Q2_Extended_Primary_Data_Matrix file contains 920 unique cell line identifier. How does that match with the number of 906 cell lines provided by you? Could you also explain how this is calculated?
I´m trying to understand the data as best as possible.
Thank you very much for reading and answering my question.
kind regards,
Selina