Goto

Collaborating Authors

 Huber, Florian


MassSpecGym: A benchmark for the discovery and identification of molecules

arXiv.org Artificial Intelligence

The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: \textit{de novo} molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at \url{https://github.com/pluskal-lab/MassSpecGym}.


Asymmetries in Financial Spillovers

arXiv.org Machine Learning

Financial shocks, such as the one observed during the global financial crisis, exhibit important domestic and international consequences on macroeconomic aggregates (see, e.g., Dovern and van Roye, 2014; Ciccarelli et al., 2016; Prieto et al., 2016; Gerba et al., 2024). Policymakers in central banks and governmental institutions, who aim to smooth business cycles and thus alleviate the negative effects of adverse financial disruptions, need to understand how such shocks impact the economy and propagate internationally to implement policies in a forward-looking manner. The recent literature provides plenty of evidence on the domestic and international effects of US financial shocks (see Balke, 2000; Gilchrist and Zakrajšek, 2012; Cesa-Bianchi and Sokol, 2022). These papers find that financial shocks exert powerful effects on domestic output but also that US-based shocks spill over to foreign economies and trigger declines in international economic activity. Such effects might be subject to time variation (Abbate et al., 2016).


Bayesian Nonlinear Regression using Sums of Simple Functions

arXiv.org Machine Learning

This paper proposes a new Bayesian machine learning model that can be applied to large datasets arising in macroeconomics. Our framework sums over many simple two-component location mixtures. The transition between components is determined by a logistic function that depends on a single threshold variable and two hyperparameters. Each of these individual models only accounts for a minor portion of the variation in the endogenous variables. But many of them are capable of capturing arbitrary nonlinear conditional mean relations. Conjugate priors enable fast and efficient inference. In simulations, we show that our approach produces accurate point and density forecasts. In a real-data exercise, we forecast US macroeconomic aggregates and consider the nonlinear effects of financial shocks in a large-scale nonlinear VAR.


Predictive Density Combination Using a Tree-Based Synthesis Function

arXiv.org Machine Learning

Bayesian predictive synthesis (BPS) provides a method for combining multiple predictive distributions based on agent/expert opinion analysis theory and encompasses a range of existing density forecast pooling methods. The key ingredient in BPS is a ``synthesis'' function. This is typically specified parametrically as a dynamic linear regression. In this paper, we develop a nonparametric treatment of the synthesis function using regression trees. We show the advantages of our tree-based approach in two macroeconomic forecasting applications. The first uses density forecasts for GDP growth from the euro area's Survey of Professional Forecasters. The second combines density forecasts of US inflation produced by many regression models involving different predictors. Both applications demonstrate the benefits -- in terms of improved forecast accuracy and interpretability -- of modeling the synthesis function nonparametrically.


Grouping Shapley Value Feature Importances of Random Forests for explainable Yield Prediction

arXiv.org Artificial Intelligence

Explainability in yield prediction helps us fully explore the potential of machine learning models that are already able to achieve high accuracy for a variety of yield prediction scenarios. The data included for the prediction of yields are intricate and the models are often difficult to understand. However, understanding the models can be simplified by using natural groupings of the input features. Grouping can be achieved, for example, by the time the features are captured or by the sensor used to do so. The state-of-the-art for interpreting machine learning models is currently defined by the game-theoretic approach of Shapley values. To handle groups of features, the calculated Shapley values are typically added together, ignoring the theoretical limitations of this approach. We explain the concept of Shapley values directly computed for predefined groups of features and introduce an algorithm to compute them efficiently on tree structures. We provide a blueprint for designing swarm plots that combine many local explanations for global understanding. Extensive evaluation of two different yield prediction problems shows the worth of our approach and demonstrates how we can enable a better understanding of yield prediction models in the future, ultimately leading to mutual enrichment of research and application.


Enhanced Bayesian Neural Networks for Macroeconomics and Finance

arXiv.org Machine Learning

In recent decades, statistical agencies, governmental institutions and central banks increasingly collect vast datasets. Practitioners and academics rely on these datasets to form forecasts about the future, efficiently tailor policies or improve decisions at the corporate level. However, this abundance of data also gives rise to the curse of dimensionality and questions related to separating signal (i.e., extracting information from important covariates) from noise (i.e., covariates which do not convey meaningful information) are key for carrying out precise inference. Fortunately, the recent literature on statistical and econometric modeling in high dimensions using regularization-based techniques offers a range of solutions (see, e.g., Carvalho et al., 2010; Bhattacharya and Dunson, 2011; Griffin and Brown, 2013; Belmonte et al., 2014; Huber et al., 2021). One key shortcoming, however, is that these models often assume linearity between a given response variable (or in general a vector of responses) and a possibly huge panel of covariates. The reason for this is simplicity in estimation and interpretation. Apart from these very general reasons, allowing for arbitrary functional relations in the conditional mean introduces substantial conceptual challenges.


Extreme Gradient Boosting for Yield Estimation compared with Deep Learning Approaches

arXiv.org Artificial Intelligence

Accurate prediction of crop yield before harvest is of great importance for crop logistics, market planning, and food distribution around the world. Yield prediction requires monitoring of phenological and climatic characteristics over extended time periods to model the complex relations involved in crop development. Remote sensing satellite images provided by various satellites circumnavigating the world are a cheap and reliable way to obtain data for yield prediction. The field of yield prediction is currently dominated by Deep Learning approaches. While the accuracies reached with those approaches are promising, the needed amounts of data and the ``black-box'' nature can restrict the application of Deep Learning methods. The limitations can be overcome by proposing a pipeline to process remote sensing images into feature-based representations that allow the employment of Extreme Gradient Boosting (XGBoost) for yield prediction. A comparative evaluation of soybean yield prediction within the United States shows promising prediction accuracies compared to state-of-the-art yield prediction systems based on Deep Learning. Feature importances expose the near-infrared spectrum of light as an important feature within our models. The reported results hint at the capabilities of XGBoost for yield prediction and encourage future experiments with XGBoost for yield prediction on other crops in regions all around the world.


Approximate Bayesian inference and forecasting in huge-dimensional multi-country VARs

arXiv.org Machine Learning

The Panel Vector Autoregressive (PVAR) model is a popular tool for macroeconomic forecasting and structural analysis in multi-country applications since it allows for spillovers between countries in a very flexible fashion. However, this flexibility means that the number of parameters to be estimated can be enormous leading to over-parameterization concerns. Bayesian global-local shrinkage priors, such as the Horseshoe prior used in this paper, can overcome these concerns, but they require the use of Markov Chain Monte Carlo (MCMC) methods rendering them computationally infeasible in high dimensions. In this paper, we develop computationally efficient Bayesian methods for estimating PVARs using an integrated rotated Gaussian approximation (IRGA). This exploits the fact that whereas own country information is often important in PVARs, information on other countries is often unimportant. Using an IRGA, we split the the posterior into two parts: one involving own country coefficients, the other involving other country coefficients. Fast methods such as approximate message passing or variational Bayes can be used on the latter and, conditional on these, the former are estimated with precision using MCMC methods. In a forecasting exercise involving PVARs with up to $18$ variables for each of $38$ countries, we demonstrate that our methods produce good forecasts quickly.


Nowcasting in a Pandemic using Non-Parametric Mixed Frequency VARs

arXiv.org Machine Learning

This paper develops Bayesian econometric methods for posterior and predictive inference in a non-parametric mixed frequency VAR using additive regression trees. We argue that regression tree models are ideally suited for macroeconomic nowcasting in the face of the extreme observations produced by the pandemic due to their flexibility and ability to model outliers. In a nowcasting application involving four major countries in the European Union, we find substantial improvements in nowcasting performance relative to a linear mixed frequency VAR. A detailed examination of the predictive densities in the first six months of 2020 shows where these improvements are achieved.


Inference in Bayesian Additive Vector Autoregressive Tree Models

arXiv.org Machine Learning

Vector autoregressive (VAR) models assume linearity between the endogenous variables and their lags. This linearity assumption might be overly restrictive and could have a deleterious impact on forecasting accuracy. As a solution, we propose combining VAR with Bayesian additive regression tree (BART) models. The resulting Bayesian additive vector autoregressive tree (BAVART) model is capable of capturing arbitrary non-linear relations between the endogenous variables and the covariates without much input from the researcher. Since controlling for heteroscedasticity is key for producing precise density forecasts, our model allows for stochastic volatility in the errors. Using synthetic and real data, we demonstrate the advantages of our methods. For Eurozone data, we show that our nonparametric approach improves upon commonly used forecasting models and that it produces impulse responses to an uncertainty shock that are consistent with established findings in the literature.