Goto

Collaborating Authors

 best fit



Automatic Outlier Rectification via Optimal Transport

Neural Information Processing Systems

In this paper, we propose a novel conceptual framework to detect outliers using optimal transport with a concave cost function. Conventional outlier detection approaches typically use a two-stage procedure: first, outliers are detected and removed, and then estimation is performed on the cleaned data.


Reviews: Mixed vine copulas as joint models of spike counts and local field potentials

Neural Information Processing Systems

The development of flexible methods to model the joint distribution between continuous and random variables is a important problem with many application areas, one of which, as the authors note, is neuroscience. Copula which allow for both discrete and continuous random variables are one means of approaching this problem, and the development of general and computationally tractable methods for fitting and performing inference with such models is of broad interest. The paper makes multiple methodological contributions, which I find valuable. The proposed family of models seems flexible and likely useful in practice. While others have previously proposed pair copula constructions as well as efficient algorithms for sampling from discrete copulas, the development of pair copula constructions and associated efficient algorithms for sampling and inference for mixed discrete and continuous data is valuable.


Automatic Outlier Rectification via Optimal Transport

Blanchet, Jose, Li, Jiajin, Pelger, Markus, Zanotti, Greg

arXiv.org Machine Learning

In this paper, we propose a novel conceptual framework to detect outliers using optimal transport with a concave cost function. Conventional outlier detection approaches typically use a two-stage procedure: first, outliers are detected and removed, and then estimation is performed on the cleaned data. However, this approach does not inform outlier removal with the estimation task, leaving room for improvement. To address this limitation, we propose an automatic outlier rectification mechanism that integrates rectification and estimation within a joint optimization framework. We take the first step to utilize an optimal transport distance with a concave cost function to construct a rectification set in the space of probability distributions. Then, we select the best distribution within the rectification set to perform the estimation task. Notably, the concave cost function we introduced in this paper is the key to making our estimator effectively identify the outlier during the optimization process. We discuss the fundamental differences between our estimator and optimal transport-based distributionally robust optimization estimator. finally, we demonstrate the effectiveness and superiority of our approach over conventional approaches in extensive simulation and empirical analyses for mean estimation, least absolute regression, and the fitting of option implied volatility surfaces.


X Hacking: The Threat of Misguided AutoML

Sharma, Rahul, Redyuk, Sergey, Mukherjee, Sumantrak, Sipka, Andrea, Vollmer, Sebastian, Selby, David

arXiv.org Artificial Intelligence

Machine learning models are increasingly used to make decisions that affect human lives, society and the environment, in areas such as medical diagnosis, criminal justice and public policy. However, these models are often complex and opaque--especially with the increasing ubiquity of deep learning and generative AI--making it difficult to understand how and why they produce certain predictions. Explainable AI (XAI) is a field of research that aims to provide interpretable and transparent explanations for the outputs of machine learning models. The growing demand for model interpretability, along with a trend for'data-driven' decisions, has the unexpected side-effect of creating an increased incentive for abuse and manipulation. Data analysts may have a vested interest or be pressured to present a certain explanation for a model's predictions, whether to confirm a pre-specified conclusion, to conceal a hidden agenda, or to avoid ethical scrutiny. In this paper, we introduce the concept of explanation hacking or X-hacking, a form of p-hacking applied to XAI metrics. X-hacking refers to the practice of deliberately searching for and selecting models that produce a desired explanation while maintaining'acceptable' predictive performance, according to some benchmark. Unlike fairwashing attacks, X-hacking does not involve manipulating the model architecture or its explanations; rather it explores plausible combinations of analysis decisions.


Harnessing Data Augmentation to Quantify Uncertainty in the Early Estimation of Single-Photon Source Quality

Kedziora, David Jacob, Musiał, Anna, Rudno-Rudziński, Wojciech, Gabrys, Bogdan

arXiv.org Artificial Intelligence

Novel methods for rapidly estimating single-photon source (SPS) quality have been promoted in recent literature to address the expensive and time-consuming nature of experimental validation via intensity interferometry. However, the frequent lack of uncertainty discussions and reproducible details raises concerns about their reliability. This study investigates the use of data augmentation, a machine learning technique, to supplement experimental data with bootstrapped samples and quantify the uncertainty of such estimates. Eight datasets obtained from measurements involving a single InGaAs/GaAs epitaxial quantum dot serve as a proof-of-principle example. Analysis of one of the SPS quality metrics derived from efficient histogram fitting of the synthetic samples, i.e. the probability of multi-photon emission events, reveals significant uncertainty contributed by stochastic variability in the Poisson processes that describe detection rates. Ignoring this source of error risks severe overconfidence in both early quality estimates and claims for state-of-the-art SPS devices. Additionally, this study finds that standard least-squares fitting is comparable to using a Poisson likelihood, and expanding averages show some promise for early estimation. Also, reducing background counts improves fitting accuracy but does not address the Poisson-process variability. Ultimately, data augmentation demonstrates its value in supplementing physical experiments; its benefit here is to emphasise the need for a cautious assessment of SPS quality.


Ellipsoid fitting with the Cayley transform

Melikechi, Omar, Dunson, David B.

arXiv.org Machine Learning

We introduce Cayley transform ellipsoid fitting (CTEF), an algorithm that uses the Cayley transform to fit ellipsoids to noisy data in any dimension. Unlike many ellipsoid fitting methods, CTEF is ellipsoid specific, meaning it always returns elliptic solutions, and can fit arbitrary ellipsoids. It also significantly outperforms other fitting methods when data are not uniformly distributed over the surface of an ellipsoid. Inspired by growing calls for interpretable and reproducible methods in machine learning, we apply CTEF to dimension reduction, data visualization, and clustering in the context of cell cycle and circadian rhythm data and several classical toy examples. Since CTEF captures global curvature, it extracts nonlinear features in data that other machine learning methods fail to identify. For example, on the clustering examples CTEF outperforms 10 popular algorithms.


Token-Level Fitting Issues of Seq2seq Models

Bao, Guangsheng, Teng, Zhiyang, Zhang, Yue

arXiv.org Artificial Intelligence

Sequence-to-sequence (seq2seq) models have been widely used for natural language processing, computer vision, and other deep learning tasks. We find that seq2seq models trained with early-stopping suffer from issues at the token level. In particular, while some tokens in the vocabulary demonstrate overfitting, others underfit when training is stopped. Experiments show that the phenomena are pervasive in different models, even in fine-tuned large pretrained-models. We identify three major factors that influence token-level fitting, which include token frequency, parts-of-speech, and prediction discrepancy. Further, we find that external factors such as language, model size, domain, data scale, and pretraining can also influence the fitting of tokens.


Autocorrelations Decay in Texts and Applicability Limits of Language Models

Mikhaylovskiy, Nikolay, Churilov, Ilya

arXiv.org Artificial Intelligence

To avoid any terminological doubt, when we write "models of the language", we refer to any models that explain some linguistic phenomena, while "language models" refer to probabilistic language models as defined in Subsection 2.3 Probabilistic Language Models. While not long ago probabilistic language models were just models that assign probabilities to sequences of words [4], now they are the cornerstone of any task in computational linguistics through few-shot learning [6], prompt engineering [38] or fine-tuning [13]. On the other hand, current language models fail to catch long-range dependencies in the text consistently. For example, text generation with maximum likelihood target leads to rapid text degeneration, and consistent text generation requires probabilistic sampling and other tricks [22]. Large language models such as GPT-3 [6] push the boundary of "short text" rather far (specifically, to 2048 tokens), but do not remove the problem. Our contributions in this work are the following: We explain how the laws of autocorrelations decay in texts are related to applicability of language models to long texts; We pioneer the use of pretrained word vectors for autocorrelation computations that allows us to study a widest range of autocorrelation distances; We show that the autocorrelations in literary texts decay according to power laws for all these distances; We show that distributional semantics typically provides coherent autocorrelations decay exponents for texts translated to multiple languages, unlike earlier flawed approaches; We show that the behavior of autocorrelations decay in generated texts is quantitatively and often qualitatively different from the literary texts.


Learn Linear Regression ForMachine Learning

#artificialintelligence

Machine learning allows an algorithm to become more accurate at predicting outcomes without being explicitly programmed to do so. Predicting is one of the things that ML can do but actually, you can do much more cool stuff with it too and once you go deep into it you'll learn all about it. You can Read My Machine Learning Posts Here. So until now, we've done a lot of things with data. We've handled missing values, handled string data and we'll learn to do much more cool stuff in the future.