Overview
Rethinking Theory of Mind Benchmarks for LLMs: Towards A User-Centered Perspective
Wang, Qiaosi, Zhou, Xuhui, Sap, Maarten, Forlizzi, Jodi, Shen, Hong
The last couple of years have witnessed emerging research that appropriates Theory-of-Mind (ToM) tasks designed for humans to benchmark LLM's ToM capabilities as an indication of LLM's social intelligence. However, this approach has a number of limitations. Drawing on existing psychology and AI literature, we summarize the theoretical, methodological, and evaluation limitations by pointing out that certain issues are inherently present in the original ToM tasks used to evaluate human's ToM, which continues to persist and exacerbated when appropriated to benchmark LLM's ToM. Taking a human-computer interaction (HCI) perspective, these limitations prompt us to rethink the definition and criteria of ToM in ToM benchmarks in a more dynamic, interactional approach that accounts for user preferences, needs, and experiences with LLMs in such evaluations. We conclude by outlining potential opportunities and challenges towards this direction.
NTIRE 2025 Challenge on Cross-Domain Few-Shot Object Detection: Methods and Results
Fu, Yuqian, Qiu, Xingyu, Ren, Bin, Fu, Yanwei, Timofte, Radu, Sebe, Nicu, Yang, Ming-Hsuan, Van Gool, Luc, Zhang, Kaijin, Nong, Qingpeng, Dong, Xiugang, Gao, Hong, Zhou, Xiangsheng, Pan, Jiancheng, Liu, Yanxing, He, Xiao, Li, Jiahao, Sun, Yuze, Huang, Xiaomeng, Zhang, Zhenyu, Ma, Ran, Liu, Yuhan, Zhuang, Zijian, Yi, Shuai, Zou, Yixiong, Hong, Lingyi, Chen, Mingxi, Li, Runze, Sheng, Xingdong, Zhang, Wenqiang, Chen, Weisen, Yan, Yongxin, Chen, Xinguo, Shao, Yuanjie, Zuo, Zhengrong, Sang, Nong, Wu, Hao, Sun, Haoran, Hu, Shuming, Zhang, Yan, Shi, Zhiguang, Zhang, Yu, Chen, Chao, Wang, Tao, Feng, Da, Zhuo, Linhai, Lin, Ziming, Huang, Yali, Me, Jie, Yang, Yiming, Guo, Mi, Jiu, Mingyuan, Xu, Mingliang, Xiong, Maomao, Zhang, Qunshu, Cao, Xinyu, Yang, Yuqing, Sheng, Dianmo, Zhao, Xuanpu, Li, Zhiyu, Ding, Xuyang, Li, Wenqian
Cross-Domain Few-Shot Object Detection (CD-FSOD) poses significant challenges to existing object detection and few-shot detection models when applied across domains. In conjunction with NTIRE 2025, we organized the 1st CD-FSOD Challenge, aiming to advance the performance of current object detectors on entirely novel target domains with only limited labeled data. The challenge attracted 152 registered participants, received submissions from 42 teams, and concluded with 13 teams making valid final submissions. Participants approached the task from diverse perspectives, proposing novel models that achieved new state-of-the-art (SOTA) results under both open-source and closed-source settings. In this report, we present an overview of the 1st NTIRE 2025 CD-FSOD Challenge, highlighting the proposed solutions and summarizing the results submitted by the participants.
Explainable Artificial Intelligence techniques for interpretation of food datasets: a review
Arrighi, Leonardo, de Moraes, Ingrid Alves, Zullich, Marco, Simonato, Michele, Barbin, Douglas Fernandes, Junior, Sylvio Barbon
Artificial Intelligence (AI) has become essential for analyzing complex data and solving highly-challenging tasks. It is being applied across numerous disciplines beyond computer science, including Food Engineering, where there is a growing demand for accurate and trustworthy predictions to meet stringent food quality standards. However, this requires increasingly complex AI models, raising reliability concerns. In response, eXplainable AI (XAI) has emerged to provide insights into AI decision-making, aiding model interpretation by developers and users. Nevertheless, XAI remains underutilized in Food Engineering, limiting model reliability. For instance, in food quality control, AI models using spectral imaging can detect contaminants or assess freshness levels, but their opaque decision-making process hinders adoption. XAI techniques such as SHAP (Shapley Additive Explanations) and Grad-CAM (Gradient-weighted Class Activation Mapping) can pinpoint which spectral wavelengths or image regions contribute most to a prediction, enhancing transparency and aiding quality control inspectors in verifying AI-generated assessments. This survey presents a taxonomy for classifying food quality research using XAI techniques, organized by data types and explanation methods, to guide researchers in choosing suitable approaches. We also highlight trends, challenges, and opportunities to encourage the adoption of XAI in Food Engineering.
LLM$\times$MapReduce-V2: Entropy-Driven Convolutional Test-Time Scaling for Generating Long-Form Articles from Extremely Long Resources
Wang, Haoyu, Fu, Yujia, Zhang, Zhu, Wang, Shuo, Ren, Zirui, Wang, Xiaorong, Li, Zhili, He, Chaoqun, An, Bo, Liu, Zhiyuan, Sun, Maosong
Long-form generation is crucial for a wide range of practical applications, typically categorized into short-to-long and long-to-long generation. While short-to-long generations have received considerable attention, generating long texts from extremely long resources remains relatively underexplored. The primary challenge in long-to-long generation lies in effectively integrating and analyzing relevant information from extensive inputs, which remains difficult for current large language models (LLMs). In this paper, we propose LLM$\times$MapReduce-V2, a novel test-time scaling strategy designed to enhance the ability of LLMs to process extremely long inputs. Drawing inspiration from convolutional neural networks, which iteratively integrate local features into higher-level global representations, LLM$\times$MapReduce-V2 utilizes stacked convolutional scaling layers to progressively expand the understanding of input materials. Both quantitative and qualitative experimental results demonstrate that our approach substantially enhances the ability of LLMs to process long inputs and generate coherent, informative long-form articles, outperforming several representative baselines. Both LLM$\times$MapReduce-V2 and SurveyEval are publicly available at https://github.com/thunlp/LLMxMapReduce .
Towards Interpretable Deep Generative Models via Causal Representation Learning
Moran, Gemma E., Aragam, Bryon
Recent developments in generative artificial intelligence (AI) rely on machine learning techniques such as deep learning and generative modeling to achieve state-of-the-art performance across wide-ranging domains. These methods' surprising performance is due in part to their ability to learn implicit "representations'' of complex, multi-modal data. Unfortunately, deep neural networks are notoriously black boxes that obscure these representations, making them difficult to interpret or analyze. To resolve these difficulties, one approach is to build new interpretable neural network models from the ground up. This is the goal of the emerging field of causal representation learning (CRL) that uses causality as a vector for building flexible, interpretable, and transferable generative AI. CRL can be seen as a culmination of three intrinsically statistical problems: (i) latent variable models such as factor analysis; (ii) causal graphical models with latent variables; and (iii) nonparametric statistics and deep learning. This paper reviews recent progress in CRL from a statistical perspective, focusing on connections to classical models and statistical and causal identifiablity results. This review also highlights key application areas, implementation strategies, and open statistical questions in CRL.
Hessian stability and convergence rates for entropic and Sinkhorn potentials via semiconcavity
Greco, Giacomo, Tamanini, Luca
In this paper we determine quantitative stability bounds fo r the Hessian of entropic potentials, i.e., the dual solution to the entropic optimal transport proble m. Up to authors' knowledge this is the first work addressing this second-orde r quantitative stability estimate in general unbounded settings. Our proof strategy relies on se miconcavity properties of entropic potentials and on the representation of entropic transport plans as laws of forward and backward diffusion processes, known as Schr odinger bridges. Moreov er, our approach allows to deduce a stochastic proof of quantitative stability entropic estim ates and integrated gradient estimates as well. Finally, as a direct consequence of these stability bounds, we deduce exponential convergence rates for gradient and Hessian of Sinkhorn iter ates along Sinkhorn's algorithm, a problem that was still open in unbounded settings. Our rates have a polynomial dependence on the regularization parameter.
ROSFD: Robust Online Streaming Fraud Detection with Resilience to Concept Drift in Data Streams
Continuous generation of streaming data from diverse sources, such as online transactions and digital interactions, necessitates timely fraud detection. Traditional batch processing methods often struggle to capture the rapidly evolving patterns of fraudulent activities. This paper highlights the critical importance of processing streaming data for effective fraud detection. To address the inherent challenges of latency, scalability, and concept drift in streaming environments, we propose a robust online streaming fraud detection (ROSFD) framework. Our proposed framework comprises two key stages: (i) Stage One: Offline Model Initialization. In this initial stage, a model is built in offline settings using incremental learning principles to overcome the "cold-start" problem. (ii) Stage Two: Real-time Model Adaptation. In this dynamic stage, drift detection algorithms (viz.,, DDM, EDDM, and ADWIN) are employed to identify concept drift in the incoming data stream and incrementally train the model accordingly. This "train-only-when-required" strategy drastically reduces the number of retrains needed without significantly impacting the area under the receiver operating characteristic curve (AUC). Overall, ROSFD utilizing ADWIN as the drift detection method demonstrated the best performance among the employed methods. In terms of model efficacy, Adaptive Random Forest consistently outperformed other models, achieving the highest AUC in four out of five datasets.
The Future of MLLM Prompting is Adaptive: A Comprehensive Experimental Evaluation of Prompt Engineering Methods for Robust Multimodal Performance
Mohanty, Anwesha, Parthasarathy, Venkatesh Balavadhani, Shahid, Arsalan
Multimodal Large Language Models (MLLMs) are set to transform how machines process and generate human-like responses by integrating diverse modalities such as text, images, and code. Yet, effectively harnessing their capabilities hinges on optimal prompt engineering. We present a comprehensive experimental evaluation of seven prompt engineering methods applied to 13 open-source MLLMs over 24 tasks spanning Reasoning and Compositionality, Multimodal Understanding and Alignment, Complex Code Generation and Execution, and Knowledge Retrieval and Integration. Our approach stratifies models by parameter count into Small (<4B), Medium (4B-10B), and Large (>10B) categories and compares prompting techniques including Zero-Shot, One-Shot, Few-Shot, Chain-of-Thought, Analogical, Generated Knowledge, and Tree-of-Thought. While Large MLLMs excel in structured tasks such as code generation, achieving accuracies up to 96.88% under Few-Shot prompting, all models struggle with complex reasoning and abstract understanding, often yielding accuracies below 60% and high hallucination rates. Structured reasoning prompts frequently increased hallucination up to 75% in small models and led to longer response times (over 20 seconds in Large MLLMs), while simpler prompting methods provided more concise and efficient outputs. No single prompting method uniformly optimises all task types. Instead, adaptive strategies combining example-based guidance with selective structured reasoning are essential to enhance robustness, efficiency, and factual accuracy. Our findings offer practical recommendations for prompt engineering and support more reliable deployment of MLLMs across applications including AI-assisted coding, knowledge retrieval, and multimodal content understanding.
Continual learning for rotating machinery fault diagnosis with cross-domain environmental and operational variations
Risca, Diogo, Lourenço, Afonso, Marreiros, Goreti
Although numerous machine learning models exist to detect issues like rolling bearing strain and deformation, typically caused by improper mounting, overloading, or poor lubrication, these models often struggle to isolate faults from the noise of real-world operational and environmental variability. Conditions such as variable loads, high temperatures, stress, and rotational speeds can mask early signs of failure, making reliable detection challenging. To address these limitations, this work proposes a continual deep learning approach capable of learning across domains that share underlying structure over time. This approach goes beyond traditional accuracy metrics by addressing four second-order challenges: catastrophic forgetting (where new learning overwrites past knowledge), lack of plasticity (where models fail to adapt to new data), forward transfer (using past knowledge to improve future learning), and backward transfer (refining past knowledge with insights from new domains). The method comprises a feature generator and domain-specific classifiers, allowing capacity to grow as new domains emerge with minimal interference, while an experience replay mechanism selectively revisits prior domains to mitigate forgetting. Moreover, nonlinear dependencies across domains are exploited by prioritizing replay from those with the highest prior errors, refining models based on most informative past experiences. Experiments show high average domain accuracy (up to 88.96%), with forgetting measures as low as .0027 across non-stationary class-incremental environments.
Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure
Gigant, Théo, Guinaudeau, Camille, Dufaux, Frédéric
Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.