inst
sudoLLM: On Multi-role Alignment of Language Models
Saha, Soumadeep, Chaturvedi, Akshay, Mahapatra, Joy, Garain, Utpal
User authorization-based access privileges are a key feature in many safety-critical systems, but have not been extensively studied in the large language model (LLM) realm. In this work, drawing inspiration from such access control systems, we introduce sudoLLM, a novel framework that results in multi-role aligned LLMs, i.e., LLMs that account for, and behave in accordance with, user access rights. sudoLLM injects subtle user-based biases into queries and trains an LLM to utilize this bias signal in order to produce sensitive information if and only if the user is authorized. We present empirical results demonstrating that this approach shows substantially improved alignment, generalization, resistance to prefix-based jailbreaking attacks, and ``fails-closed''. The persistent tension between the language modeling objective and safety alignment, which is often exploited to jailbreak LLMs, is somewhat resolved with the aid of the injected bias signal. Our framework is meant as an additional security layer, and complements existing guardrail mechanisms for enhanced end-to-end safety with LLMs.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (8 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Government (0.93)
Proceedings of the 2025 XCSP3 Competition
Audemard, Gilles, Lecoutre, Christophe, Lonca, Emmanuel
Competition 2025, following those published in 2022 [2], 2023 [3], and 2024 [4]. The website containing all detailed results of this international competition is available at: https://www.cril.univ-artois.fr/XCSP25 The organization of this 2025 competition involved the following tasks: adjusting general details (dates, tracks, .. . These instances can be found in this archive. Some (usually minor) differences may exist when compiling the models presented in this document and those that can be found in this archive. Remember that the complete description, Version 3.2, of the format (XCSP For the 2025 competition, 33 problems have been selected. They are succinctly presented in Table 1.1. For each problem, the type of the involved (global) constraints is indicated. At this point, do note that making a good selection of problems/instances is a difficult task. When table is followed by (), it means that starred tables are involved. It is always interesting to see how constraint solvers behave when the instances of a problem become harder and harder. This is what we call the scaling behavior of solvers.
- Europe > Austria > Styria > Graz (0.04)
- North America > United States > Kansas (0.04)
- Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
- (42 more...)
- Information Technology > Security & Privacy (0.46)
- Leisure & Entertainment > Sports (0.45)
Are Humans as Brittle as Large Language Models?
Li, Jiahui, Papay, Sean, Klinger, Roman
The output of large language models (LLMs) is unstable, due both to non-determinism of the decoding process as well as to prompt brittleness. While the intrinsic non-determinism of LLM generation may mimic existing uncertainty in human annotations through distributional shifts in outputs, it is largely assumed, yet unexplored, that the prompt brittleness effect is unique to LLMs. This raises the question: do human annotators show similar sensitivity to prompt changes? If so, should prompt brittleness in LLMs be considered problematic? One may alternatively hypothesize that prompt brittleness correctly reflects human annotation variances. To fill this research gap, we systematically compare the effects of prompt modifications on LLMs and identical instruction modifications for human annotators, focusing on the question of whether humans are similarly sensitive to prompt perturbations. To study this, we prompt both humans and LLMs for a set of text classification tasks conditioned on prompt variations. Our findings indicate that both humans and LLMs exhibit increased brittleness in response to specific types of prompt modifications, particularly those involving the substitution of alternative label sets or label formats. However, the distribution of human judgments is less affected by typographical errors and reversed label order than that of LLMs.
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (14 more...)
Handling Label Noise via Instance-Level Difficulty Modeling and Dynamic Optimization
Zhang, Kuan, Chai, Chengliang, Xu, Jingzhe, Zhang, Chi, Han, Han, Yuan, Ye, Wang, Guoren, Cao, Lei
Recent studies indicate that deep neural networks degrade in generalization performance under noisy supervision. Existing methods focus on isolating clean subsets or correcting noisy labels, facing limitations such as high computational costs, heavy hyperparameter tuning process, and coarse-grained optimization. To address these challenges, we propose a novel two-stage noisy learning framework that enables instance-level optimization through a dynamically weighted loss function, avoiding hyperparameter tuning. To obtain stable and accurate information about noise modeling, we introduce a simple yet effective metric, termed wrong event, which dynamically models the cleanliness and difficulty of individual samples while maintaining computational costs. Our framework first collects wrong event information and builds a strong base model. Then we perform noise-robust training on the base model, using a probabilistic model to handle the wrong event information of samples. Experiments on five synthetic and real-world LNL benchmarks demonstrate our method surpasses state-of-the-art methods in performance, achieves a nearly 75% reduction in computational time and improves model scalability.
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Arizona (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration
Li, Yanshu, Yang, Jianjiang, Yun, Tian, Feng, Pinyuan, Huang, Jinfa, Tang, Ruixiang
Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input ICL sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures ICL sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a novel and valuable perspective for interpreting and improving multimodal ICL.
IQNN-CS: Interpretable Quantum Neural Network for Credit Scoring
Khan, Abdul Samad, Innan, Nouhaila, Khalique, Aeysha, Shafique, Muhammad
Credit scoring is a high-stakes task in financial services, where model decisions directly impact individuals' access to credit and are subject to strict regulatory scrutiny. While Quantum Machine Learning (QML) offers new computational capabilities, its black-box nature poses challenges for adoption in domains that demand transparency and trust. In this work, we present IQNN-CS, an interpretable quantum neural network framework designed for multiclass credit risk classification. The architecture combines a variational QNN with a suite of post-hoc explanation techniques tailored for structured data. To address the lack of structured interpretability in QML, we introduce Inter-Class Attribution Alignment (ICAA), a novel metric that quantifies attribution divergence across predicted classes, revealing how the model distinguishes between credit risk categories. Evaluated on two real-world credit datasets, IQNN-CS demonstrates stable training dynamics, competitive predictive performance, and enhanced interpretability. Our results highlight a practical path toward transparent and accountable QML models for financial decision-making.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > New York (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Asia > Pakistan > Punjab > Lahore Division > Lahore (0.04)
Generalist vs Specialist Time Series Foundation Models: Investigating Potential Emergent Behaviors in Assessing Human Health Using PPG Signals
Kataria, Saurabh, Wu, Yi, Chen, Zhaoliang, Kwak, Hyunjung Gloria, Xu, Yuhao, Panchumarthi, Lovely Yeswanth, Xiao, Ran, Lu, Jiaying, Ermis, Ayca, Zhao, Anni, Yan, Runze, Federov, Alex, Liu, Zewen, Wu, Xu, Jin, Wei, Yang, Carl, Grunwell, Jocelyn, Brown, Stephanie R., Shah, Amit, Jabaley, Craig, Buchman, Tim, Bhavani, Sivasubramanium V, Lee, Randall J., Hu, Xiao
Foundation models are large-scale machine learning models that are pre-trained on massive amounts of data and can be adapted for various downstream tasks. They have been extensively applied to tasks in Natural Language Processing and Computer Vision with models such as GPT, BERT, and CLIP. They are now also increasingly gaining attention in time-series analysis, particularly for physiological sensing. However, most time series foundation models are specialist models - with data in pre-training and testing of the same type, such as Electrocardiogram, Electroencephalogram, and Photoplethysmogram (PPG). Recent works, such as MOMENT, train a generalist time series foundation model with data from multiple domains, such as weather, traffic, and electricity. This paper aims to conduct a comprehensive benchmarking study to compare the performance of generalist and specialist models, with a focus on PPG signals. Through an extensive suite of total 51 tasks covering cardiac state assessment, laboratory value estimation, and cross-modal inference, we comprehensively evaluate both models across seven dimensions, including win score, average performance, feature quality, tuning gain, performance variance, transferability, and scalability. These metrics jointly capture not only the models' capability but also their adaptability, robustness, and efficiency under different fine-tuning strategies, providing a holistic understanding of their strengths and limitations for diverse downstream scenarios. In a full-tuning scenario, we demonstrate that the specialist model achieves a 27% higher win score. Finally, we provide further analysis on generalization, fairness, attention visualizations, and the importance of training data choice.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Czechia > South Moravian Region > Brno (0.04)
- (6 more...)
- Health & Medicine > Therapeutic Area > Hematology (1.00)
- Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
- Health & Medicine > Health Care Technology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.87)
BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models
Srivastava, Gaurav, Hussain, Aafiya, Bi, Zhenyu, Roy, Swastik, Pitre, Priya, Lu, Meng, Ziyadi, Morteza, Wang, Xuan
Evaluating language models fairly is becoming harder as static benchmarks available on the internet risk contamination by training data. This makes it unclear whether models are truly reasoning or just recalling answers. In this paper, we introduce BeyondBench, an evaluation framework that avoids this problem by using algorithmic problem generation. Unlike traditional benchmarks that risk contamination from internet-scale training data, BeyondBench creates mathematically grounded problems on the fly, ensuring each test remains fresh and uncontaminated. Our framework covers 44 algorithmic tasks with a total of 117 variations, grouped into three difficulty levels: the Easy Suite (29 tasks) for basic arithmetic and statistics, the Medium Suite (5 tasks, 49 variations) for sequence patterns and reasoning, and the Hard Suite (10 tasks, 68 variations) tackling NP-complete and constraint satisfaction problems. Each task generates problems from a combinatorial space larger than 10^15 unique instances, with solutions verified deterministically by mathematical proofs. We evaluated 101 language models, including 85 open-source and 16 closed-source models, spanning sizes from 0.5B to 141B parameters and multiple quantization schemes. Our results show consistent reasoning deficiencies across model families, with performance degrading sharply as problem complexity increases from polynomial to exponential. In our Hard Suite evaluations, models such as Gemini-2.5-pro, Llama-3.3-70B, and Qwen2.5-72B achieved average accuracies of 56.38%, 26.91%, and 33.60%, respectively. Moreover, we observe that performance drops drastically without tool usage, with GPT-5, GPT-5-mini, and GPT-5-nano showing a decline of 16.81%, 28.05%, and 47.59% accuracy on the hard suite. Our leaderboard is publicly available at https://ctrl-gaurav.github.io/BeyondBench/
- Europe > Austria > Vienna (0.13)
- Asia > Vietnam > Hanoi > Hanoi (0.04)
- North America > United States > Virginia (0.04)
- (8 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors
Sadallah, Abdelrahman, Baumgärtner, Tim, Gurevych, Iryna, Briscoe, Ted
Providing constructive feedback to paper authors is a core component of peer review. With reviewers increasingly having less time to perform reviews, automated support systems are required to ensure high reviewing quality, thus making the feedback in reviews useful for authors. To this end, we identify four key aspects of review comments (individual points in weakness sections of reviews) that drive the utility for authors: Actionability, Grounding & Specificity, Verifiability, and Helpfulness. To enable evaluation and development of models assessing review comments, we introduce the RevUtil dataset. We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes. The synthetic data additionally contains rationales, i.e., explanations for the aspect score of a review comment. Employing the RevUtil dataset, we benchmark fine-tuned models for assessing review comments on these aspects and generating rationales. Our experiments demonstrate that these fine-tuned models achieve agreement levels with humans comparable to, and in some cases exceeding, those of powerful closed models like GPT-4o. Our analysis further reveals that machine-generated reviews generally underperform human reviews on our four aspects.
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > Singapore (0.04)
- (12 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
Automated Triaging and Transfer Learning of Incident Learning Safety Reports Using Large Language Representational Models
Beidler, Peter, Nguyen, Mark, Lybarger, Kevin, Holmberg, Ola, Ford, Eric, Kang, John
PURPOSE: Incident reports are an important tool for safety and quality improvement in healthcare, but manual review is time-consuming and requires subject matter expertise. Here we present a natural language processing (NLP) screening tool to detect high-severity incident reports in radiation oncology across two institutions. METHODS AND MATERIALS: We used two text datasets to train and evaluate our NLP models: 7,094 reports from our institution (Inst.), and 571 from IAEA SAFRON (SF), all of which had severity scores labeled by clinical content experts. We trained and evaluated two types of models: baseline support vector machines (SVM) and BlueBERT which is a large language model pretrained on PubMed abstracts and hospitalized patient data. We assessed for generalizability of our model in two ways. First, we evaluated models trained using Inst.-train on SF-test. Second, we trained a BlueBERT_TRANSFER model that was first fine-tuned on Inst.-train then on SF-train before testing on SF-test set. To further analyze model performance, we also examined a subset of 59 reports from our Inst. dataset, which were manually edited for clarity. RESULTS Classification performance on the Inst. test achieved AUROC 0.82 using SVM and 0.81 using BlueBERT. Without cross-institution transfer learning, performance on the SF test was limited to an AUROC of 0.42 using SVM and 0.56 using BlueBERT. BlueBERT_TRANSFER, which was fine-tuned on both datasets, improved the performance on SF test to AUROC 0.78. Performance of SVM, and BlueBERT_TRANSFER models on the manually curated Inst. reports (AUROC 0.85 and 0.74) was similar to human performance (AUROC 0.81). CONCLUSION: In summary, we successfully developed cross-institution NLP models on incident report text from radiation oncology centers. These models were able to detect high-severity reports similarly to humans on a curated dataset.
- Oceania > Australia (0.04)
- North America > United States > Virginia (0.04)
- North America > Canada (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.94)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Nuclear Medicine (1.00)
- Energy > Power Industry > Utilities > Nuclear (0.34)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.69)