Goto

Collaborating Authors

 Butler, Thomas


Multi-scale Sinusoidal Embeddings Enable Learning on High Resolution Mass Spectrometry Data

arXiv.org Artificial Intelligence

Small molecules in biological samples are studied to provide information about disease states, environmental toxins, natural product drug discovery, and many other applications. The primary window into the composition of small molecule mixtures is tandem mass spectrometry (MS2), which produces high sensitivity and part per million resolution data. We adopt multi-scale sinusoidal embeddings of the mass data in MS2 designed to meet the challenge of learning from the full resolution of MS2 data. Using these embeddings, we provide a new state of the art model for spectral library search, the standard task for initial evaluation of MS2 data. We vary the resolution of the input spectra directly by using different floating point representations of the MS2 data, and show that the resulting sinusoidal embeddings are able to learn from high resolution portion of the input MS2 data. We apply dimensionality reduction to the embeddings that result from different resolution input masses to show the essential role multi-scale sinusoidal embeddings play in learning from MS2 data. Metabolomics is the study of the small molecule (1,000 Daltons) contents of complex biological samples. Tandem Mass Spectrometry (MS/MS), in conjunction with chromatography, is one of the most commonly used tools in metabolomics.


Efficiently predicting high resolution mass spectra with graph neural networks

arXiv.org Artificial Intelligence

The identification of unknown small molecules in complex chemical mixtures is a primary challenge in many areas of chemical and biological science. The standard high-throughput approach to small molecule identification is tandem mass spectrometry (MS/MS), with diverse applications including metabolomics [1], drug discovery [2], clinical diagnostics [3], forensics [4], and environmental monitoring [5]. The key bottleneck in MS/MS is structural elucidation: given a mass spectrum, we must determine the 2D structure of the molecule it represents. This problem is far from solved, and adversely impacts all areas of science that use MS/MS. Typically only 2 4% of spectra are identified in untargeted metabolomics experiments [6], and a recent competition saw no more than 30% accuracy [7]. Because MS/MS is a lossy measurement, and existing training sets are small, direct prediction of structures from spectra is particularly challenging. Therefore the most common approach is spectral library search, which casts the problem as information retrieval [8]: an observed spectrum is queried against a library of spectra with known structures. This provides an informative prior, and has the advantage of easy interpretability as the entire space of solutions is known.


Hidden Biases in Unreliable News Detection Datasets

arXiv.org Artificial Intelligence

Automatic unreliable news detection is a research problem with great potential impact. Recently, several papers have shown promising results on large-scale news datasets with models that only use the article itself without resorting to any fact-checking mechanism or retrieving any supporting evidence. In this work, we take a closer look at these datasets. While they all provide valuable resources for future research, we observe a number of problems that may lead to results that do not generalize in more realistic settings. Specifically, we show that selection bias during data collection leads to undesired artifacts in the datasets. In addition, while most systems train and predict at the level of individual articles, overlapping article sources in the training and evaluation data can provide a strong confounding factor that models can exploit. In the presence of this confounding factor, the models can achieve good performance by directly memorizing the site-label mapping instead of modeling the real task of unreliable news detection. We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap. Using the observations and experimental results, we provide practical suggestions on how to create more reliable datasets for the unreliable news detection task. We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.