Overview
Keeping Medical AI Healthy and Trustworthy: A Review of Detection and Correction Methods for System Degradation
Guan, Hao, Bates, David, Zhou, Li
Artificial intelligence (AI) is increasingly integrated into modern healthcare, offering powerful support for clinical decision-making. However, in real-world settings, AI systems may experience performance degradation over time, due to factors such as shifting data distributions, changes in patient characteristics, evolving clinical protocols, and variations in data quality. These factors can compromise model reliability, posing safety concerns and increasing the likelihood of inaccurate predictions or adverse outcomes. This review presents a forward-looking perspective on monitoring and maintaining the "health" of AI systems in healthcare. We highlight the urgent need for continuous performance monitoring, early degradation detection, and effective self-correction mechanisms. The paper begins by reviewing common causes of performance degradation at both data and model levels. We then summarize key techniques for detecting data and model drift, followed by an in-depth look at root cause analysis. Correction strategies are further reviewed, ranging from model retraining to test-time adaptation. Our survey spans both traditional machine learning models and state-of-the-art large language models (LLMs), offering insights into their strengths and limitations. Finally, we discuss ongoing technical challenges and propose future research directions. This work aims to guide the development of reliable, robust medical AI systems capable of sustaining safe, long-term deployment in dynamic clinical settings.
SurfFill: Completion of LiDAR Point Clouds via Gaussian Surfel Splatting
Strobel, Svenja, Innmann, Matthias, Egger, Bernhard, Stamminger, Marc, Franke, Linus
LiDAR-captured point clouds are often considered the gold standard in active 3D reconstruction. While their accuracy is exceptional in flat regions, the capturing is susceptible to miss small geometric structures and may fail with dark, absorbent materials. Alternatively, capturing multiple photos of the scene and applying 3D photogrammetry can infer these details as they often represent feature-rich regions. However, the accuracy of LiDAR for featureless regions is rarely reached. Therefore, we suggest combining the strengths of LiDAR and camera-based capture by introducing SurfFill: a Gaussian surfel-based LiDAR completion scheme. We analyze LiDAR capturings and attribute LiDAR beam divergence as a main factor for artifacts, manifesting mostly at thin structures and edges. We use this insight to introduce an ambiguity heuristic for completed scans by evaluating the change in density in the point cloud. This allows us to identify points close to missed areas, which we can then use to grow additional points from to complete the scan. For this point growing, we constrain Gaussian surfel reconstruction [Huang et al. 2024] to focus optimization and densification on these ambiguous areas. Finally, Gaussian primitives of the reconstruction in ambiguous areas are extracted and sampled for points to complete the point cloud. To address the challenges of large-scale reconstruction, we extend this pipeline with a divide-and-conquer scheme for building-sized point cloud completion. We evaluate on the task of LiDAR point cloud completion of synthetic and real-world scenes and find that our method outperforms previous reconstruction methods.
Zero-Shot Instruction Following in RL via Structured LTL Representations
Giuri, Mattia, Jackermeier, Mathias, Abate, Alessandro
Linear temporal logic (LTL) is a compelling framework for specifying complex, structured tasks for reinforcement learning (RL) agents. Recent work has shown that interpreting LTL instructions as finite automata, which can be seen as high-level programs monitoring task progress, enables learning a single generalist policy capable of executing arbitrary instructions at test time. However, existing approaches fall short in environments where multiple high-level events (i.e., atomic propositions) can be true at the same time and potentially interact in complicated ways. In this work, we propose a novel approach to learning a multi-task policy for following arbitrary LTL instructions that addresses this shortcoming. Our method conditions the policy on sequences of simple Boolean formulae, which directly align with transitions in the automaton, and are encoded via a graph neural network (GNN) to yield structured task representations. Experiments in a complex chess-based environment demonstrate the advantages of our approach.
Spoken Conversational Agents with Large Language Models
Yang, Chao-Han Huck, Stolcke, Andreas, Heck, Larry
Building on this, we will examine joint text-speech pre-training (Chiu et al., 2022; Bar-rault et al., 2023; Chen et al., 2022) methods, This section will provide a comprehensive look at how state-of-the-art voice-interfaced LLMs (Reid et al., 2024; Chu et al., Current Trends The current work in AI virtual assistants builds upon the voice-only systems of the last decade by leveraging LLMs to significantly improve the coverage and robustness of the spoken language understanding and dialogue state tracking components, in addition to substantial advancements in spoken language generation. It highlights recent advancements in multi-turn dialogue systems, encompassing both LLM-based open-domain dialogue (ODD) and task-oriented dialogue (TOD) systems, as well as relevant datasets and evaluation metrics.
Sparse Computations in Deep Learning Inference
Tasou, Ioanna, Mpakos, Panagiotis, Vlachos, Angelos, Adamopoulos, Dionysios, Giannakopoulos, Georgios, Katsikopoulos, Konstantinos, Karaparisis, Ioannis, Lazou, Maria, Loukovitis, Spyridon, Mei, Areti, Poulopoulou, Anastasia, Dimitriou, Angeliki, Filandrianos, Giorgos, Galanopoulos, Dimitrios, Karampinis, Vasileios, Mitsouras, Ilias, Spanos, Nikolaos, Anastasiadis, Petros, Doudalis, Ioannis, Nikas, Konstantinos, Retsinas, George, Tzouveli, Paraskevi, Giannoula, Christina, Koziris, Nectarios, Papadopoulou, Nikela, Stamou, Giorgos, Voulodimos, Athanasios, Goumas, Georgios
The computational demands of modern Deep Neural Networks (DNNs) are immense and constantly growing. While training costs usually capture public attention, inference demands are also contributing in significant computational, energy and environmental footprints. Sparsity stands out as a critical mechanism for drastically reducing these resource demands. However, its potential remains largely untapped and is not yet fully incorporated in production AI systems. To bridge this gap, this work provides the necessary knowledge and insights for performance engineers keen to get involved in deep learning inference optimization. In particular, in this work we: a) discuss the various forms of sparsity that can be utilized in DNN inference, b) explain how the original dense computations translate to sparse kernels, c) provide an extensive bibliographic review of the state-of-the-art in the implementation of these kernels for CPUs and GPUs, d) discuss the availability of sparse datasets in support of sparsity-related research and development, e) explore the current software tools and frameworks that provide robust sparsity support, and f) present evaluation results of different implementations of the key SpMM and SDDMM kernels on CPU and GPU platforms. Ultimately, this paper aims to serve as a resource for performance engineers seeking to develop and deploy highly efficient sparse deep learning models in productions.
nuScenes Revisited: Progress and Challenges in Autonomous Driving
Fong, Whye Kit, Liong, Venice Erin, Tan, Kok Seang, Caesar, Holger
Autonomous Vehicles (AV) and Advanced Driver Assistance Systems (ADAS) have been revolutionized by Deep Learning. As a data-driven approach, Deep Learning relies on vast amounts of driving data, typically labeled in great detail. As a result, datasets, alongside hardware and algorithms, are foundational building blocks for the development of AVs. In this work we revisit one of the most widely used autonomous driving datasets: the nuScenes dataset. nuScenes exemplifies key trends in AV development, being the first dataset to include radar data, to feature diverse urban driving scenes from two continents, and to be collected using a fully autonomous vehicle operating on public roads, while also promoting multi-modal sensor fusion, standardized benchmarks, and a broad range of tasks including perception, localization \& mapping, prediction and planning. We provide an unprecedented look into the creation of nuScenes, as well as its extensions nuImages and Panoptic nuScenes, summarizing many technical details that have hitherto not been revealed in academic publications. Furthermore, we trace how the influence of nuScenes impacted a large number of other datasets that were released later and how it defined numerous standards that are used by the community to this day. Finally, we present an overview of both official and unofficial tasks using the nuScenes dataset and review major methodological developments, thereby offering a comprehensive survey of the autonomous driving literature, with a particular focus on nuScenes.
Opening the Black Box: Nowcasting Singapore's GDP Growth and its Explainability
Timely assessment of current conditions is essential especially for small, open economies such as Singapore, where external shocks transmit rapidly to domestic activity. We develop a real-time nowcasting framework for quarterly GDP growth using a high-dimensional panel of approximately 70 indicators, encompassing economic and financial indicators over 1990Q1-2023Q2. The analysis covers penalized regressions, dimensionality-reduction methods, ensemble learning algorithms, and neural architectures, benchmarked against a Random Walk, an AR(3), and a Dynamic Factor Model. The pipeline preserves temporal ordering through an expanding-window walk-forward design with Bayesian hyperparameter optimization, and uses moving block-bootstrap procedures both to construct prediction intervals and to obtain confidence bands for feature-importance measures. It adopts model-specific and XAI-based explainability tools. A Model Confidence Set procedure identifies statistically superior learners, which are then combined through simple, weighted, and exponentially weighted schemes; the resulting time-varying weights provide an interpretable representation of model contributions. Predictive ability is assessed via Giacomini-White tests. Empirical results show that penalized regressions, dimensionality-reduction models, and GRU networks consistently outperform all benchmarks, with RMSFE reductions of roughly 40-60%; aggregation delivers further gains. Feature-attribution methods highlight industrial production, external trade, and labor-market indicators as dominant drivers of Singapore's short-run growth dynamics.
Mirror, Mirror on the Wall -- Which is the Best Model of Them All?
Large Language Models (LLMs) have become one of the most transformative tools across many applications, as they have significantly boosted productivity and achieved impressive results in various domains such as finance, healthcare, education, telecommunications, and law, among others. Typically, state-of-the-art (SOTA) foundation models are developed by large corporations based on large data collections and substantial computational and financial resources required to pretrain such models from scratch. These foundation models then serve as the basis for further development and domain adaptation for specific use cases or tasks. However, given the dynamic and fast-paced nature of launching new foundation models, the process of selecting the most suitable model for a particular use case, application, or domain becomes increasingly complex. We argue that there are two main dimensions that need to be taken into consideration when selecting a model for further training: a qualitative dimension (which model is best suited for a task based on information, for instance, taken from model cards) and a quantitative dimension (which is the best performing model). The quantitative performance of models is assessed through leaderboards, which rank models based on standardized benchmarks and provide a consistent framework for comparing different LLMs. In this work, we address the analysis of the quantitative dimension by exploring the current leaderboards and benchmarks. To illustrate this analysis, we focus on the medical domain as a case study, demonstrating the evolution, current landscape, and practical significance of this quantitative evaluation dimension. Finally, we propose a Model Selection Methodology (MSM), a systematic approach designed to guide the navigation, prioritization, and selection of the model that best aligns with a given use case.
Deep Research: A Systematic Survey
Shi, Zhengliang, Chen, Yiqun, Li, Haitao, Sun, Weiwei, Ni, Shiyu, Lyu, Yougang, Fan, Run-Ze, Jin, Bowen, Weng, Yixuan, Zhu, Minjun, Xie, Qiujie, Guo, Xinyu, Yang, Qu, Wu, Jiayi, Zhao, Jujia, Tang, Xiaqiang, Ma, Xinbei, Wang, Cunxiang, Mao, Jiaxin, Ai, Qingyao, Huang, Jen-Tse, Wang, Wenxuan, Zhang, Yue, Yang, Yiming, Tu, Zhaopeng, Ren, Zhaochun
Large language models (LLMs) have rapidly evolved from text generators into powerful problem solvers. Yet, many open tasks demand critical thinking, multi-source, and verifiable outputs, which are beyond single-shot prompting or standard retrieval-augmented generation. Recently, numerous studies have explored Deep Research (DR), which aims to combine the reasoning capabilities of LLMs with external tools, such as search engines, thereby empowering LLMs to act as research agents capable of completing complex, open-ended tasks. This survey presents a comprehensive and systematic overview of deep research systems, including a clear roadmap, foundational components, practical implementation techniques, important challenges, and future directions. Specifically, our main contributions are as follows: (i) we formalize a three-stage roadmap and distinguish deep research from related paradigms; (ii) we introduce four key components: query planning, information acquisition, memory management, and answer generation, each paired with fine-grained sub-taxonomies; (iii) we summarize optimization techniques, including prompting, supervised fine-tuning, and agentic reinforcement learning; and (iv) we consolidate evaluation criteria and open challenges, aiming to guide and facilitate future development. As the field of deep research continues to evolve rapidly, we are committed to continuously updating this survey to reflect the latest progress in this area.
BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models
Fang, Yi, Xu, Haoran, Han, Jiaxin, Ding, Sirui, Wang, Yizhi, Wang, Yue, Wang, Xuan
Foundation models have revolutionized various fields such as natural language processing (NLP) and computer vision (CV). While efforts have been made to transfer the success of the foundation models in general AI domains to biology, existing works focus on directly adopting the existing foundation model architectures from general machine learning domains without a systematic design considering the unique physicochemical and structural properties of each biological data modality. This leads to suboptimal performance, as these repurposed architectures struggle to capture the long-range dependencies, sparse information, and complex underlying ``grammars'' inherent to biological data. To address this gap, we introduce BioArc, a novel framework designed to move beyond intuition-driven architecture design towards principled, automated architecture discovery for biological foundation models. Leveraging Neural Architecture Search (NAS), BioArc systematically explores a vast architecture design space, evaluating architectures across multiple biological modalities while rigorously analyzing the interplay between architecture, tokenization, and training strategies. This large-scale analysis identifies novel, high-performance architectures, allowing us to distill a set of empirical design principles to guide future model development. Furthermore, to make the best of this set of discovered principled architectures, we propose and compare several architecture prediction methods that effectively and efficiently predict optimal architectures for new biological tasks. Overall, our work provides a foundational resource and a principled methodology to guide the creation of the next generation of task-specific and foundation models for biology.