ADataset for Distilling Knowledge Priors from Literature for Therapeutic Design
–Neural Information Processing Systems
AI-driven discovery can greatly reduce design time and enhance new therapeutics' effectiveness. Models using simulators explore broad design spaces but risk violating implicit constraints due to a lack of experimental priors. For example, in a new analysis across diverse models on the GuacaMol benchmark using supervised classifiers, over 60% of molecules proposed had a high probability of being mutagenic. In this work, we introduce Medex, a dataset of priors for design problems extracted from literature describing compounds used in lab settings. It is constructed with LLM pipelines for discovering therapeutic entities in relevant paragraphs and summarizing information in concise fair-use facts. Medex consists of 32.3 million pairs of natural language facts, and appropriate entity representations (i.e.
Neural Information Processing Systems
Jun-17-2026, 03:48:01 GMT
- Country:
- North America > United States (1.00)
- Genre:
- Overview (0.92)
- Research Report
- Experimental Study (1.00)
- New Finding (0.67)
- Technology: