Goto

Collaborating Authors

 interactor



Feature-Function Curvature Analysis: A Geometric Framework for Explaining Differentiable Models

Najafi, Hamed, Luo, Dongsheng, Liu, Jason

arXiv.org Artificial Intelligence

Explainable AI (XAI) is critical for building trust in complex machine learning models, yet mainstream attribution methods often provide an incomplete, static picture of a model's final state. By collapsing a feature's role into a single score, they are confounded by non-linearity and interactions. To address this, we introduce Feature-Function Curvature Analysis (FFCA), a novel framework that analyzes the geometry of a model's learned function. FFCA produces a 4-dimensional signature for each feature, quantifying its: (1) Impact, (2) Volatility, (3) Non-linearity, and (4) Interaction. Crucially, we extend this framework into Dynamic Archetype Analysis, which tracks the evolution of these signatures throughout the training process. This temporal view moves beyond explaining what a model learned to revealing how it learns. We provide the first direct, empirical evidence of hierarchical learning, showing that models consistently learn simple linear effects before complex interactions. Furthermore, this dynamic analysis provides novel, practical diagnostics for identifying insufficient model capacity and predicting the onset of overfitting. Our comprehensive experiments demonstrate that FFCA, through its static and dynamic components, provides the essential geometric context that transforms model explanation from simple quantification to a nuanced, trustworthy analysis of the entire learning process.


AutoCode: LLMs as Problem Setters for Competitive Programming

Zhou, Shang, Zheng, Zihan, Liu, Kaiyuan, Shen, Zeyu, Cheng, Zerui, Chen, Zexing, He, Hansen, Yao, Jianzhu, Mao, Huanzhi, Mang, Qiuyang, Fu, Tianfu, Li, Beichen, Li, Dongruixuan, Chai, Wenhao, Liu, Zhuang, Korolova, Aleksandra, Henderson, Peter, Jaques, Natasha, Viswanath, Pramod, Xie, Saining, Shang, Jingbo

arXiv.org Artificial Intelligence

Writing competitive programming problems is exacting. Authors must: set constraints, input distributions, and edge cases that rule out shortcuts; target specific algorithms (e.g., max-flow, dynamic programming, data structures); and calibrate complexity beyond the reach of most competitors. We argue that this makes for an ideal test of general large language model capabilities and study whether they can do this reliably. We introduce AutoCode, which uses multiple rounds of validation to yield competition-grade problem statements and test cases. On held-out problems, AutoCode test suites approach 99% consistency with official judgments, a significant improvement over current state-of-the-art methods like HardTests, which achieve less than 81%. Furthermore, starting with a random seed problem, AutoCode can create novel variants with reference and brute-force solutions. By cross-verifying these generated solutions against test cases, we can further filter out malformed problems. Our system ensures high correctness, as verified by human experts. AutoCode successfully produces novel problems judged by Grandmaster-level (top 0.3%) competitive programmers to be of contest quality.



SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control

Lu, Quanfeng, Ma, Zhantao, Zhong, Shuai, Wang, Jin, Yu, Dahai, Ng, Michael K., Luo, Ping

arXiv.org Artificial Intelligence

The rapid advancement of large vision language models (L VLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current L VLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross-round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high-level and low-level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi-agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi-agent systems.


KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

Yu, Zhuohao, Gao, Chang, Yao, Wenjin, Wang, Yidong, Ye, Wei, Wang, Jindong, Xie, Xing, Zhang, Yue, Zhang, Shikun

arXiv.org Artificial Intelligence

Automatic evaluation methods for large language models (LLMs) are hindered by data contamination, leading to inflated assessments of their effectiveness. Existing strategies, which aim to detect contaminated texts, focus on quantifying contamination status instead of accurately gauging model performance. In this paper, we introduce KIEval, a Knowledge-grounded Interactive Evaluation framework, which incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Starting with a question in a conventional LLM benchmark involving domain-specific knowledge, KIEval utilizes dynamically generated, multi-round, and knowledge-focused dialogues to determine whether a model's response is merely a recall of benchmark answers or demonstrates a deep comprehension to apply knowledge in more complex conversations. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization. We also reveal that data contamination brings no contribution or even negative effect to models' real-world applicability and understanding, and existing contamination detection methods for LLMs can only identify contamination in pre-training but not during supervised fine-tuning.


Interactively Learning Social Media Representations Improves News Source Factuality Detection

Mehta, Nikhil, Goldwasser, Dan

arXiv.org Artificial Intelligence

The rise of social media has enabled the widespread propagation of fake news, text that is published with an intent to spread misinformation and sway beliefs. Rapidly detecting fake news, especially as new events arise, is important to prevent misinformation. While prior works have tackled this problem using supervised learning systems, automatedly modeling the complexities of the social media landscape that enables the spread of fake news is challenging. On the contrary, having humans fact check all news is not scalable. Thus, in this paper, we propose to approach this problem interactively, where humans can interact to help an automated system learn a better social media representation quality. On real world events, our experiments show performance improvements in detecting factuality of news sources, even after few human interactions.


Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

Yang, Qian, Li, Yunxin, Hu, Baotian, Ma, Lin, Ding, Yuxing, Zhang, Min

arXiv.org Artificial Intelligence

Visual Entailment with natural language explanations aims to infer the relationship between a text-image pair and generate a sentence to explain the decision-making process. Previous methods rely mainly on a pre-trained vision-language model to perform the relation inference and a language model to generate the corresponding explanation. However, the pre-trained vision-language models mainly build token-level alignment between text and image yet ignore the high-level semantic alignment between the phrases (chunks) and visual contents, which is critical for vision-language reasoning. Moreover, the explanation generator based only on the encoded joint representation does not explicitly consider the critical decision-making points of relation inference. Thus the generated explanations are less faithful to visual-language reasoning. To mitigate these problems, we propose a unified Chunk-aware Alignment and Lexical Constraint based method, dubbed as CALeC. It contains a Chunk-aware Semantic Interactor (arr. CSI), a relation inferrer, and a Lexical Constraint-aware Generator (arr. LeCG). Specifically, CSI exploits the sentence structure inherent in language and various image regions to build chunk-aware semantic alignment. Relation inferrer uses an attention-based reasoning network to incorporate the token-level and chunk-level vision-language representations. LeCG utilizes lexical constraints to expressly incorporate the words or chunks focused by the relation inferrer into explanation generation, improving the faithfulness and informativeness of the explanations. We conduct extensive experiments on three datasets, and experimental results indicate that CALeC significantly outperforms other competitor models on inference accuracy and quality of generated explanations.


On the Linguistic and Computational Requirements for Creating Face-to-Face Multimodal Human-Machine Interaction

Ranhel, João, de Lima, Cacilda Vilela

arXiv.org Artificial Intelligence

In this study, conversations between humans and avatars are linguistically, organizationally, and structurally analyzed, focusing on what is necessary for creating face-to-face multimodal interfaces for machines. We videorecorded thirty-four human-avatar interactions, performed complete linguistic microanalysis on video excerpts, and marked all the occurrences of multimodal actions and events. Statistical inferences were applied to data, allowing us to comprehend not only how often multimodal actions occur but also how multimodal events are distributed between the speaker (emitter) and the listener (recipient). We also observed the distribution of multimodal occurrences for each modality. The data show evidence that double-loop feedback is established during a face-to-face conversation. This led us to propose that knowledge from Conversation Analysis (CA), cognitive science, and Theory of Mind (ToM), among others, should be incorporated into the ones used for describing human-machine multimodal interactions. Face-to-face interfaces require an additional control layer to the multimodal fusion layer. This layer has to organize the flow of conversation, integrate the social context into the interaction, as well as make plans concerning 'what' and 'how' to progress on the interaction. This higher level is best understood if we incorporate insights from CA and ToM into the interface system.


AutoRec: An Automated Recommender System

Wang, Ting-Hsiang, Song, Qingquan, Han, Xiaotian, Liu, Zirui, Jin, Haifeng, Hu, Xia

arXiv.org Machine Learning

For example, NCF [8] takes user-item implicit feedback data as inputs for the rating prediction task; and DeepFM [6] leverages both numerical and categorical data for the CTR prediction task. However, high degree of specialization comes at the expense of model adaptability and tuning complexity. As recommendation tasks evolve over time and additional types of data are collected, the originally apt model can either become obsolete or require tremendous tuning efforts. So far, several pipelines for recommender systems, e.g., OpenRec [16] and SMORe [4], tried to address the adaptability issue via providing modular base blocks that can be selected according to the context of recommendation. Nevertheless, both determining the blocks to use and tuning the model parameters are not straightforward when facing new data and changing tasks. In order to bridge the gap, we present AutoRec, which aims to provide an end-to-end solution to automate model selection and hyperparameter tuning. While many AutoML libraries, such as Auto-Sklearn [5] and TPOT [12] have shown promising results in general-purpose machine learning tasks (e.g., regression and hyperparameter tuning) and