Regression
Review for NeurIPS paper: Counterfactual Prediction for Bundle Treatment
Additional Feedback: As mentioned above, I think this method is very nice, but should be framed differently. In particular, the issue being addressed is not confounding _bias_; it is sample inefficiency when estimating the regression model f_{\theta_p}. This distinction is important in the causal inference literature, because a bias does not disappear with sample size. However, in this context, under the unconfoundedness assumption, if the model f_{\theta_p} is sufficiently flexible, it will converge to the same true counterfactual model in the large sample limit regardless of how the data are weighted (this is consistent with the experiments in the paper). In other words, the population risks E_{cf} and E_f w are minimized at the same function.
Coherent Local Explanations for Mathematical Optimization
Otto, Daan, Kurtz, Jannis, Birbil, S. Ilker
The surge of explainable artificial intelligence methods seeks to enhance transparency and explainability in machine learning models. At the same time, there is a growing demand for explaining decisions taken through complex algorithms used in mathematical optimization. However, current explanation methods do not take into account the structure of the underlying optimization problem, leading to unreliable outcomes. In response to this need, we introduce Coherent Local Explanations for Mathematical Optimization (CLEMO). CLEMO provides explanations for multiple components of optimization models, the objective value and decision variables, which are coherent with the underlying model structure. Our sampling-based procedure can provide explanations for the behavior of exact and heuristic solution algorithms. The effectiveness of CLEMO is illustrated by experiments for the shortest path problem, the knapsack problem, and the vehicle routing problem.
Efficient distributional regression trees learning algorithms for calibrated non-parametric probabilistic forecasts
Quentin, Duchemin, Guillaume, Obozinski
The perspective of developing trustworthy AI for critical applications in science and engineering requires machine learning techniques that are capable of estimating their own uncertainty. In the context of regression, instead of estimating a conditional mean, this can be achieved by producing a predictive interval for the output, or to even learn a model of the conditional probability $p(y|x)$ of an output $y$ given input features $x$. While this can be done under parametric assumptions with, e.g. generalized linear model, these are typically too strong, and non-parametric models offer flexible alternatives. In particular, for scalar outputs, learning directly a model of the conditional cumulative distribution function of $y$ given $x$ can lead to more precise probabilistic estimates, and the use of proper scoring rules such as the weighted interval score (WIS) and the continuous ranked probability score (CRPS) lead to better coverage and calibration properties. This paper introduces novel algorithms for learning probabilistic regression trees for the WIS or CRPS loss functions. These algorithms are made computationally efficient thanks to an appropriate use of known data structures - namely min-max heaps, weight-balanced binary trees and Fenwick trees. Through numerical experiments, we demonstrate that the performance of our methods is competitive with alternative approaches. Additionally, our methods benefit from the inherent interpretability and explainability of trees. As a by-product, we show how our trees can be used in the context of conformal prediction and explain why they are particularly well-suited for achieving group-conditional coverage guarantees.
Probing Internal Representations of Multi-Word Verbs in Large Language Models
Kissane, Hassane, Schilling, Achim, Krauss, Patrick
This study investigates the internal representations of verb-particle combinations, called multi-word verbs, within transformer-based large language models (LLMs), specifically examining how these models capture lexical and syntactic properties at different neural network layers. Using the BERT architecture, we analyze the representations of its layers for two different verb-particle constructions: phrasal verbs like 'give up' and prepositional verbs like 'look at'. Our methodology includes training probing classifiers on the internal representations to classify these categories at both word and sentence levels. The results indicate that the model's middle layers achieve the highest classification accuracies. To further analyze the nature of these distinctions, we conduct a data separability test using the Generalized Discrimination Value (GDV). While GDV results show weak linear separability between the two verb types, probing classifiers still achieve high accuracy, suggesting that representations of these linguistic categories may be non-linearly separable. This aligns with previous research indicating that linguistic distinctions in neural networks are not always encoded in a linearly separable manner. These findings computationally support usage-based claims on the representation of verb-particle constructions and highlight the complex interaction between neural network architectures and linguistic structures.
Review for NeurIPS paper: A convex optimization formulation for multivariate regression
Weaknesses: The major weaknesses of the paper are listed below: 1. There are some potential inaccuracies in the description of the algorithm. For example, in Section 3.1, the first equalities in the two lines of equations after line 210 should be \approx instead, right? And does the notation p_{\tau_B} ' denote the sub-gradient of p_{\tau_B}? In general, some more explanations about the linearization here would be helpful.
Review for NeurIPS paper: A convex optimization formulation for multivariate regression
This paper proposes a new parametrization of the multivariate linear regression problem. It shows that under this new parametrization, it is easier to employ sparsity inducing penalty terms on the inverse covariance matrix. The paper suggests a sequential relaxation algorithm. The reviewers noted the novelty of the approach and numerous strengths. The simulation experiments (in the supplementary material) explore the method in the context of several connectivity scenarios. However, one weakness is the exploration of the performance of the model on real data scenarios.
A Classification System Approach in Predicting Chinese Censorship
Prodani, Matt, Ze, Tianchu, Hu, Yushen
This paper is dedicated to using a classifier to predict whether a Weibo post would be censored under the Chinese internet. Through randomized sampling from \citeauthor{Fu2021} and Chinese tokenizing strategies, we constructed a cleaned Chinese phrase dataset with binary censorship markings. Utilizing various probability-based information retrieval methods on the data, we were able to derive 4 logistic regression models for classification. Furthermore, we experimented with pre-trained transformers to perform similar classification tasks. After evaluating both the macro-F1 and ROC-AUC metrics, we concluded that the Fined-Tuned BERT model exceeds other strategies in performance.
Quantifying Correlations of Machine Learning Models
Li, Yuanyuan, Sarna, Neeraj, Lin, Yang
Machine Learning models are being extensively used in safety critical applications where errors from these models could cause harm to the user. Such risks are amplified when multiple machine learning models, which are deployed concurrently, interact and make errors simultaneously. This paper explores three scenarios where error correlations between multiple models arise, resulting in such aggregated risks. Using real-world data, we simulate these scenarios and quantify the correlations in errors of different models. Our findings indicate that aggregated risks are substantial, particularly when models share similar algorithms, training datasets, or foundational models. Overall, we observe that correlations across models are pervasive and likely to intensify with increased reliance on foundational models and widely used public datasets, highlighting the need for effective mitigation strategies to address these challenges.
SyMANTIC: An Efficient Symbolic Regression Method for Interpretable and Parsimonious Model Discovery in Science and Beyond
Muthyala, Madhav R., Sorourifar, Farshud, Peng, You, Paulson, Joel A.
Symbolic regression (SR) is an emerging branch of machine learning focused on discovering simple and interpretable mathematical expressions from data. Although a wide-variety of SR methods have been developed, they often face challenges such as high computational cost, poor scalability with respect to the number of input dimensions, fragility to noise, and an inability to balance accuracy and complexity. This work introduces SyMANTIC, a novel SR algorithm that addresses these challenges. SyMANTIC efficiently identifies (potentially several) low-dimensional descriptors from a large set of candidates (from $\sim 10^5$ to $\sim 10^{10}$ or more) through a unique combination of mutual information-based feature selection, adaptive feature expansion, and recursively applied $\ell_0$-based sparse regression. In addition, it employs an information-theoretic measure to produce an approximate set of Pareto-optimal equations, each offering the best-found accuracy for a given complexity. Furthermore, our open-source implementation of SyMANTIC, built on the PyTorch ecosystem, facilitates easy installation and GPU acceleration. We demonstrate the effectiveness of SyMANTIC across a range of problems, including synthetic examples, scientific benchmarks, real-world material property predictions, and chaotic dynamical system identification from small datasets. Extensive comparisons show that SyMANTIC uncovers similar or more accurate models at a fraction of the cost of existing SR methods.
Type 2 Tobit Sample Selection Models with Bayesian Additive Regression Trees
This paper introduces Type 2 Tobit Bayesian Additive Regression Trees (TOBART-2). BART can produce accurate individual-specific treatment effect estimates. However, in practice estimates are often biased by sample selection. We extend the Type 2 Tobit sample selection model to account for nonlinearities and model uncertainty by including sums of trees in both the selection and outcome equations. A Dirichlet Process Mixture distribution for the error terms allows for departure from the assumption of bivariate normally distributed errors. Soft trees and a Dirichlet prior on splitting probabilities improve modeling of smooth and sparse data generating processes. We include a simulation study and an application to the RAND Health Insurance Experiment data set.