Representation of Molecules via Algebraic Data Types : Advancing Beyond SMILES & SELFIES
Goldstein, Oliver, March, Samuel
–arXiv.org Artificial Intelligence
We introduce a novel molecular representation through Algebraic Data Types (ADTs) - composite data structures formed through the combination of simpler types that obey algebraic laws. By explicitly considering how the datatype of a representation constrains the operations which may be performed, we ensure meaningful inference can be performed over generative models (programs with sample} and score operations). This stands in contrast to string-based representations where string-type operations may only indirectly correspond to chemical and physical molecular properties, and at worst produce nonsensical output. The ADT presented implements the Dietz representation for molecular constitution via multigraphs and bonding systems, and uses atomic coordinate data to represent 3D information and stereochemical features. This creates a general digital molecular representation which surpasses the limitations of the string-based representations and the 2D-graph based models on which they are based. In addition, we present novel support for quantum information through representation of shells, subshells, and orbitals, greatly expanding the representational scope beyond current approaches, for instance in Molecular Orbital theory. The framework's capabilities are demonstrated through key applications: Bayesian probabilistic programming is demonstrated through integration with LazyPPL, a lazy probabilistic programming library; molecules are made instances of a group under rotation, necessary for geometric learning techniques which exploit the invariance of molecular properties under different representations; and the framework's flexibility is demonstrated through an extension to model chemical reactions. After critiquing previous representations, we provide an open-source solution in Haskell - a type-safe, purely functional programming language.
arXiv.org Artificial Intelligence
Feb-7-2025
- Country:
- Europe > United Kingdom
- England > Oxfordshire > Oxford (0.27)
- North America > United States (0.92)
- Europe > United Kingdom
- Genre:
- Research Report (0.64)
- Industry:
- Energy > Oil & Gas (0.93)
- Health & Medicine
- Pharmaceuticals & Biotechnology (1.00)
- Therapeutic Area (1.00)
- Materials (0.69)
- Technology: