Supplement WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking

Neural Information Processing Systems 

SIDER is a dataset for predicting side effect from the small molecule structure. It contains 27 classification tasks, corresponding to the 27 system organ classes following MedDRA classifications [1]. If taking a closer look at the MedDRA classification on the system organ level on its website, we can find a claim of "System Organ Classes (SOCs) which are groupings by aetiology (e.g. In addition, there is a SOC to contain issues pertaining to products and one to contain social circumstances." In fact, the two tasks among the 27 tasks are named "Social circumstances" and "Product issues", that corresponds to the claims above. Predicting such label from molecular structure alone is futile and therefore does not serve the purpose of a benchmarking dataset. The other problematic example in MoleculeNet is the PCBA dataset, originally used in [44]. However, as claimed in the original paper, "It should be noted that we did not perform any preprocessing of our datasets, such as removing potential experimental artifacts". And we have demonstrated the importance of removing the experimental artifacts in the data processing pipeline in the main text. There are more example issues with MoleculeNet that can be found in [52]. For Therapeutics Data Commons (TDC) [24], we used filters in our pipeline on small molecule-related tasks on and found issues with them. The promiscuity filter is not applied due to the long running time.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found