Goto

Collaborating Authors

 Burg, Gerrit J. J. van den


Aligning Black-box Language Models with Human Judgments

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly used as automated judges to evaluate recommendation systems, search engines, and other subjective tasks, where relying on human evaluators can be costly, time-consuming, and unscalable. LLMs offer an efficient solution for continuous, automated evaluation. However, since the systems that are built and improved with these judgments are ultimately designed for human use, it is crucial that LLM judgments align closely with human evaluators to ensure such systems remain human-centered. On the other hand, aligning LLM judgments with human evaluators is challenging due to individual variability and biases in human judgments. We propose a simple yet effective framework to align LLM judgments with individual human evaluators or their aggregated judgments, without retraining or fine-tuning the LLM. Our approach learns a linear mapping between the LLM's outputs and human judgments, achieving over 142% average improvement in agreement across 29 tasks with only a small number of calibration examples used for training. Notably, our method works in zero-shot and few-shot settings, exceeds inter-human agreement on four out of six tasks, and enables smaller LLMs to achieve performance comparable to that of larger models.


Efficient Pointwise-Pairwise Learning-to-Rank for News Recommendation

arXiv.org Artificial Intelligence

News recommendation is a challenging task that involves personalization based on the interaction history and preferences of each user. Recent works have leveraged the power of pretrained language models (PLMs) to directly rank news items by using inference approaches that predominately fall into three categories: pointwise, pairwise, and listwise learning-to-rank. While pointwise methods offer linear inference complexity, they fail to capture crucial comparative information between items that is more effective for ranking tasks. Conversely, pairwise and listwise approaches excel at incorporating these comparisons but suffer from practical limitations: pairwise approaches are either computationally expensive or lack theoretical guarantees, and listwise methods often perform poorly in practice. In this paper, we propose a novel framework for PLM-based news recommendation that integrates both pointwise relevance prediction and pairwise comparisons in a scalable manner. We present a rigorous theoretical analysis of our framework, establishing conditions under which our approach guarantees improved performance. Extensive experiments show that our approach outperforms the state-of-the-art methods on the MIND and Adressa news recommendation datasets.


On Memorization in Probabilistic Deep Generative Models

arXiv.org Machine Learning

In the last few years there have been incredible successes in generative modeling through the development of deep learning techniques such as variational autoencoders (VAEs) [1, 2], generative adversarial networks (GANs) [3], normalizing flows [4, 5], and diffusion networks [6], among others. The goal of generative modeling is to learn the data distribution of a given data set, which has numerous applications such as creating realistic synthetic data, correcting data corruption, and detecting anomalies. Novel architectures for generative modeling are typically evaluated on how well a complex, high dimensional data distribution can be learned by the model and how realistic the samples from the model are. An important question in the evaluation of generative models is to what extent observations from the training data are memorized by the learning algorithm. A common technique to assess memorization in deep generative models is to look for nearest neighbors. Typically, several samples are drawn from a trained model and compared to their nearest neighbors in the training set. There are several problems with this approach. First, it has been well established that when using the Euclidean metric this test can be easily fooled by taking an image from the training set and shifting it by a few pixels [7]. For this reason, nearest neighbors in the feature space of a secondary model are sometimes used, as well as cropping and/or downsampling before identifying nearest neighbors (e.g.


Fast Meta-Learning for Adaptive Hierarchical Classifier Design

arXiv.org Machine Learning

We propose a new splitting criterion for a meta-learning approach to multiclass classifier design that adaptively merges the classes into a tree-structured hierarchy of increasingly difficult binary classification problems. The classification tree is constructed from empirical estimates of the Henze-Penrose bounds on the pairwise Bayes misclassification rates that rank the binary subproblems in terms of difficulty of classification. The proposed empirical estimates of the Bayes error rate are computed from the minimal spanning tree (MST) of the samples from each pair of classes. Moreover, a meta-learning technique is presented for quantifying the one-vs-rest Bayes error rate for each individual class from a single MST on the entire dataset. Extensive simulations on benchmark datasets show that the proposed hierarchical method can often be learned much faster than competing methods, while achieving competitive accuracy.