Goto

Collaborating Authors

 compl


Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs

Fan, Dongyang, Sabolčec, Vinko, Ansaripour, Matin, Tarun, Ayush Kumar, Jaggi, Martin, Bosselut, Antoine, Schlag, Imanol

arXiv.org Artificial Intelligence

The increasing adoption of web crawling opt-outs by copyright holders of online content raises critical questions about the impact of data compliance on large language model (LLM) performance. However, little is known about how these restrictions (and the resultant filtering of pretraining datasets) affect the capabilities of models trained using these corpora. In this work, we conceptualize this effect as the $\textit{data compliance gap}$ (DCG), which quantifies the performance difference between models trained on datasets that comply with web crawling opt-outs, and those that do not. We measure the data compliance gap in two settings: pretraining models from scratch and continual pretraining from existing compliant models (simulating a setting where copyrighted data could be integrated later in pretraining). Our experiments with 1.5B models show that, as of January 2025, compliance with web data opt-outs does not degrade general knowledge acquisition (close to 0\% DCG). However, in specialized domains such as biomedical research, excluding major publishers leads to performance declines. These findings suggest that while general-purpose LLMs can be trained to perform equally well using fully open data, performance in specialized domains may benefit from access to high-quality copyrighted sources later in training. Our study provides empirical insights into the long-debated trade-off between data compliance and downstream model performance, informing future discussions on AI training practices and policy decisions. Our website is available at https://data-compliance.github.io/.


CFIRE: A General Method for Combining Local Explanations

Müller, Sebastian, Toborek, Vanessa, Horváth, Tamás, Bauckhage, Christian

arXiv.org Artificial Intelligence

We propose a novel eXplainable AI algorithm to compute faithful, easy-to-understand, and complete global decision rules from local explanations for tabular data by combining XAI methods with closed frequent itemset mining. Our method can be used with any local explainer that indicates which dimensions are important for a given sample for a given black-box decision. This property allows our algorithm to choose among different local explainers, addressing the disagreement problem, \ie the observation that no single explanation method consistently outperforms others across models and datasets. Unlike usual experimental methodology, our evaluation also accounts for the Rashomon effect in model explainability. To this end, we demonstrate the robustness of our approach in finding suitable rules for nearly all of the 700 black-box models we considered across 14 benchmark datasets. The results also show that our method exhibits improved runtime, high precision and F1-score while generating compact and complete rules.


Complete Approximations of Incomplete Queries

Corman, Julien, Nutt, Werner, Savković, Ognjen

arXiv.org Artificial Intelligence

This paper studies the completeness of conjunctive queries over a partially complete database and the approximation of incomplete queries. Given a query and a set of completeness rules (a special kind of tuple generating dependencies) that specify which parts of the database are complete, we investigate whether the query can be fully answered, as if all data were available. If not, we explore reformulating the query into either Maximal Complete Specializations (MCSs) or the (unique up to equivalence) Minimal Complete Generalization (MCG) that can be fully answered, that is, the best complete approximations of the query from below or above in the sense of query containment. We show that the MSG can be characterized as the least fixed-point of a monotonic operator in a preorder. Then, we show that an MCS can be computed by recursive backward application of completeness rules. We study the complexity of both problems and discuss implementation techniques that rely on an ASP and Prolog engines, respectively.

  Country: Europe > Italy > Trentino-Alto Adige/Südtirol > South Tyrol (0.04)
  Genre: Research Report (0.84)
  Industry: Education (0.46)

Repetition Improves Language Model Embeddings

Springer, Jacob Mitchell, Kotha, Suhas, Fried, Daniel, Neubig, Graham, Raghunathan, Aditi

arXiv.org Artificial Intelligence

Recent approaches to improving the extraction of text embeddings from autoregressive large language models (LLMs) have largely focused on improvements to data, backbone pretrained language models, or improving task-differentiation via instructions. In this work, we address an architectural limitation of autoregressive models: token embeddings cannot contain information from tokens that appear later in the input. To address this limitation, we propose a simple approach, "echo embeddings," in which we repeat the input twice in context and extract embeddings from the second occurrence. We show that echo embeddings of early tokens can encode information about later tokens, allowing us to maximally leverage high-quality LLMs for embeddings. On the MTEB leaderboard, echo embeddings improve over classical embeddings by over 9% zero-shot and by around 0.7% when fine-tuned. Echo embeddings with a Mistral-7B model achieve state-of-the-art compared to prior open source models that do not leverage synthetic fine-tuning data.


MRI Scan Synthesis Methods based on Clustering and Pix2Pix

Baldini, Giulia, Schmidt, Melanie, Zäske, Charlotte, Caldeira, Liliana L.

arXiv.org Artificial Intelligence

We consider a missing data problem in the context of automatic segmentation methods for Magnetic Resonance Imaging (MRI) brain scans. Usually, automated MRI scan segmentation is based on multiple scans (e.g., T1-weighted, T2-weighted, T1CE, FLAIR). However, quite often a scan is blurry, missing or otherwise unusable. We investigate the question whether a missing scan can be synthesized. We exemplify that this is in principle possible by synthesizing a T2-weighted scan from a given T1-weighted scan. Our first aim is to compute a picture that resembles the missing scan closely, measured by average mean squared error (MSE). We develop/use several methods for this, including a random baseline approach, a clustering-based method and pixel-to-pixel translation method by (Pix2Pix) which is based on conditional GANs. The lowest MSE is achieved by our clustering-based method. Our second aim is to compare the methods with respect to the affect that using the synthesized scan has on the segmentation process. For this, we use a DeepMedic model trained with the four input scan modalities named above. We replace the T2-weighted scan by the synthesized picture and evaluate the segmentations with respect to the tumor identification, using Dice scores as numerical evaluation. The evaluation shows that the segmentation works well with synthesized scans (in particular, with Pix2Pix methods) in many cases.


Differentiable Constrained Imitation Learning for Robot Motion Planning and Control

Diehl, Christopher, Adamek, Janis, Krüger, Martin, Hoffmann, Frank, Bertram, Torsten

arXiv.org Artificial Intelligence

Motion planning and control are crucial components of robotics applications like automated driving. Here, spatio-temporal hard constraints like system dynamics and safety boundaries (e.g., obstacles) restrict the robot's motions. Direct methods from optimal control solve a constrained optimization problem. However, in many applications finding a proper cost function is inherently difficult because of the weighting of partially conflicting objectives. On the other hand, Imitation Learning (IL) methods such as Behavior Cloning (BC) provide an intuitive framework for learning decision-making from offline demonstrations and constitute a promising avenue for planning and control in complex robot applications. Prior work primarily relied on soft constraint approaches, which use additional auxiliary loss terms describing the constraints. However, catastrophic safety-critical failures might occur in out-of-distribution (OOD) scenarios. This work integrates the flexibility of IL with hard constraint handling in optimal control. Our approach constitutes a general framework for constraint robotic motion planning and control, as well as traffic agent simulation, whereas we focus on mobile robot and automated driving applications. Hard constraints are integrated into the learning problem in a differentiable manner, via explicit completion and gradient-based correction. Simulated experiments of mobile robot navigation and automated driving provide evidence for the performance of the proposed method.


Federated Learning Hyper-Parameter Tuning from a System Perspective

Zhang, Huanle, Fu, Lei, Zhang, Mi, Hu, Pengfei, Cheng, Xiuzhen, Mohapatra, Prasant, Liu, Xin

arXiv.org Artificial Intelligence

Federated learning (FL) is a distributed model training paradigm that preserves clients' data privacy. It has gained tremendous attention from both academia and industry. FL hyper-parameters (e.g., the number of selected clients and the number of training passes) significantly affect the training overhead in terms of computation time, transmission time, computation load, and transmission load. However, the current practice of manually selecting FL hyper-parameters imposes a heavy burden on FL practitioners because applications have different training preferences. In this paper, we propose FedTune, an automatic FL hyper-parameter tuning algorithm tailored to applications' diverse system requirements in FL training. FedTune iteratively adjusts FL hyper-parameters during FL training and can be easily integrated into existing FL systems. Through extensive evaluations of FedTune for diverse applications and FL aggregation algorithms, we show that FedTune is lightweight and effective, achieving 8.48%-26.75% system overhead reduction compared to using fixed FL hyper-parameters. This paper assists FL practitioners in designing high-performance FL training solutions. The source code of FedTune is available at https://github.com/DataSysTech/FedTune.


FedTune: Automatic Tuning of Federated Learning Hyper-Parameters from System Perspective

Zhang, Huanle, Zhang, Mi, Liu, Xin, Mohapatra, Prasant, DeLucia, Michael

arXiv.org Artificial Intelligence

Federated learning (FL) hyper-parameters significantly affect the training overheads in terms of computation time, transmission time, computation load, and transmission load. However, the current practice of manually selecting FL hyper-parameters puts a high burden on FL practitioners since various applications prefer different training preferences. In this paper, we propose FedTune, an automatic FL hyper-parameter tuning algorithm tailored to applications' diverse system requirements of FL training. FedTune is lightweight and flexible, achieving 8.48%-26.75% improvement for different datasets compared to fixed FL hyper-parameters.


A Notion of Individual Fairness for Clustering

Kleindessner, Matthäus, Awasthi, Pranjal, Morgenstern, Jamie

arXiv.org Machine Learning

A common distinction in fair machine learning, in particular in fair classification, is between group fairness and individual fairness. In the context of clustering, group fairness has been studied extensively in recent years; however, individual fairness for clustering has hardly been explored. In this paper, we propose a natural notion of individual fairness for clustering. Our notion asks that every data point, on average, is closer to the points in its own cluster than to the points in any other cluster. We study several questions related to our proposed notion of individual fairness. On the negative side, we show that deciding whether a given data set allows for such an individually fair clustering in general is NP-hard. On the positive side, for the special case of a data set lying on the real line, we propose an efficient dynamic programming approach to find an individually fair clustering. For general data sets, we investigate heuristics aimed at minimizing the number of individual fairness violations and compare them to standard clustering approaches on real data sets.


Vicious Circle Principle and Logic Programs with Aggregates

Gelfond, Michael, Zhang, Yuanlin

arXiv.org Artificial Intelligence

The paper presents a knowledge representation language $\mathcal{A}log$ which extends ASP with aggregates. The goal is to have a language based on simple syntax and clear intuitive and mathematical semantics. We give some properties of $\mathcal{A}log$, an algorithm for computing its answer sets, and comparison with other approaches.