Goto

Collaborating Authors

 level 5


Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?

Chen, Zihan, Zhang, Yiming, Zhou, Hengguang, Ding, Zenghui, Sun, Yining, Hsieh, Cho-Jui

arXiv.org Artificial Intelligence

Reinforcement Learning (RL) has emerged as a powerful paradigm for post-training Large Language Models (LLMs), significantly enhancing their capabilities on complex, multi-step reasoning tasks (Ouyang et al., 2022). Methods based on Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) (Rafailov et al., 2023) have become standard practice for aligning LLMs. These paradigms are often powered by foundational algorithms like Proximal Policy Optimization (PPO) (Schulman et al., 2017), with state-of-the-art variants such as Group Relative Policy Optimization (GRPO) (Shao et al., 2024) pushing models to achieve remarkable performance on benchmarks like GSM8K (Cobbe et al., 2021) and MA TH (Hendrycks et al., 2021). These successes, often marked by state-of-the-art results (Lewkowycz et al., 2022; Lightman et al., 2023), are widely interpreted as a significant leap forward, suggesting that RL-based alignment is a key pathway toward developing more general and robust machine reasoning systems. Despite impressive reported gains, a key question is whether current benchmarks still meaningfully assess generalization. Our findings suggest that the traditional assumption underlying benchmark design, that a model's ability to perform well on unseen test examples is sufficient to measure generalization, no longer holds for RL. We find that RL-based reasoning models trained on the training split achieve nearly the same performance as those trained directly on the test split, indicating that "unseen-ness" alone is no longer the challenging or discriminative criterion. This calls for the rethinking of evaluation: rather than relying solely on disjoint train/test splits, future benchmarks must incorporate settings that remain sensitive to deeper forms of generalization and can reveal weaknesses that simple data separation fails to expose. To systematically investigate this phenomenon, we introduce a multi-faceted empirical framework designed not merely to measure performance, but to deconstruct it.


A Maslow-Inspired Hierarchy of Engagement with AI Model

Ogot, Madara

arXiv.org Artificial Intelligence

The rapid proliferation of artificial intelligence (AI) across industry, government, and education highlights the urgent need for robust frameworks to conceptualise and guide engagement. This paper introduces the Hierarchy of Engagement with AI model, a novel maturity framework inspired by Maslow's hierarchy of needs. The model conceptualises AI adoption as a progression through eight levels, beginning with initial exposure and basic understanding and culminating in ecosystem collaboration and societal impact. Each level integrates technical, organisational, and ethical dimensions, emphasising that AI maturity is not only a matter of infrastructure and capability but also of trust, governance, and responsibility. Initial validation of the model using four diverse case studies (General Motors, the Government of Estonia, the University of Texas System, and the African Union AI Strategy) demonstrate the model's contextual flexibility across various sectors. The model provides scholars with a framework for analysing AI maturity and offers practitioners and policymakers a diagnostic and strategic planning tool to guide responsible and sustainable AI engagement. The proposed model demonstrates that AI maturity progression is multi-dimensional, requiring technological capability, ethical integrity, organisational resilience, and ecosystem collaboration.


Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs

Mustafa, Akram, Naseem, Usman, Azghadi, Mostafa Rahimi

arXiv.org Artificial Intelligence

Background: Clinical coding, particularly the classification of hierarchical ICD-10 codes from unstructured discharge summaries, is essential for healthcare operations, but remains a labor-intensive and error-prone task. Automated approaches using Large Language Models (LLMs) offer the potential to augment or replace human coders, yet their reliability and reasoning capabilities, which is needed to ensure accurate, explainable code assignments, are not well understood. Objective: This study aims to benchmark a diverse set of LLMs, both reasoning and non-reasoning models, on their ability to classify hierarchical ICD-10 codes from discharge summaries and evaluate the effect of structured reasoning on model performance. Methods: Using the MIMIC-IV dataset, the study selected 1,500 discharge summaries labeled with the top 10 most frequent ICD-10 codes, balancing dataset size with the high computational and financial cost of using LLMs. We first preprocessed the data to extract clinically relevant tokens before feeding it to the LLMs. Specifically, we used cTAKES, a clinical NLP tool, to identify medical concepts. Each summary was encoded and submitted to 11 LLMs using a standardized, structured prompt simulating a clinical coder. Models were evaluated using the F1 score across three ICD-10 levels for both primary and all diagnoses classification tasks. Reasoning models on average outperformed non-reasoning models. The Gemini 2.5 Pro model demonstrated the highest performance across tasks.


Learning to Stop Overthinking at Test Time

Bao, Hieu Tran, Dat, Nguyen Cong, Anh, Nguyen Duc, Thanh-Tung, Hoang

arXiv.org Artificial Intelligence

Test time scaling is currently one of the most active research areas that shows promise after training time scaling has reached its limits. Deep-thinking (DT) models are a class of recurrent models that can perform easy-to-hard generalization by assigning more compute to harder test samples. However, due to their inability to determine the complexity of a test sample, DT models have to use a large amount of computation for both easy and hard test samples. Excessive test time computation is wasteful and can cause the ``overthinking'' problem where more test time computation leads to worse results. In this paper, we introduce a test time training method for determining the optimal amount of computation needed for each sample during test time. We also propose Conv-LiGRU, a novel recurrent architecture for efficient and robust visual reasoning. Extensive experiments demonstrate that Conv-LiGRU is more stable than DT, effectively mitigates the ``overthinking'' phenomenon, and achieves superior accuracy.


OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance Variations

Kang, Caixin, Chen, Yubo, Ruan, Shouwei, Zhao, Shiji, Zhang, Ruochen, Wang, Jiayi, Fu, Shan, Wei, Xingxing

arXiv.org Artificial Intelligence

With the rise of deep learning, facial recognition technology has seen extensive research and rapid development. Although facial recognition is considered a mature technology, we find that existing open-source models and commercial algorithms lack robustness in certain real-world Out-of-Distribution (OOD) scenarios, raising concerns about the reliability of these systems. In this paper, we introduce OODFace, which explores the OOD challenges faced by facial recognition models from two perspectives: common corruptions and appearance variations. We systematically design 30 OOD scenarios across 9 major categories tailored for facial recognition. By simulating these challenges on public datasets, we establish three robustness benchmarks: LFW-C/V, CFP-FP-C/V, and YTF-C/V. We then conduct extensive experiments on 19 different facial recognition models and 3 commercial APIs, along with extended experiments on face masks, Vision-Language Models (VLMs), and defense strategies to assess their robustness. Based on the results, we draw several key insights, highlighting the vulnerability of facial recognition systems to OOD data and suggesting possible solutions. Additionally, we offer a unified toolkit that includes all corruption and variation types, easily extendable to other datasets. We hope that our benchmarks and findings can provide guidance for future improvements in facial recognition model robustness.


Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models

Zhao, Juntu, Deng, Junyu, Ye, Yixin, Li, Chongxuan, Deng, Zhijie, Wang, Dequan

arXiv.org Artificial Intelligence

Advancements in text-to-image diffusion models have broadened extensive downstream practical applications, but such models often encounter misalignment issues between text and image. Taking the generation of a combination of two disentangled concepts as an example, say given the prompt "a tea cup of iced coke", existing models usually generate a glass cup of iced coke because the iced coke usually co-occurs with the glass cup instead of the tea one during model training. The root of such misalignment is attributed to the confusion in the latent semantic space of text-to-image diffusion models, and hence we refer to the "a tea cup of iced coke" phenomenon as Latent Concept Misalignment (LC-Mis). We leverage large language models (LLMs) to thoroughly investigate the scope of LC-Mis, and develop an automated pipeline for aligning the latent semantics of diffusion models to text prompts. Empirical assessments confirm the effectiveness of our approach, substantially reducing LC-Mis errors and enhancing the robustness and versatility of text-to-image diffusion models. Our code and dataset have been available online for reference.


Text classification in shipping industry using unsupervised models and Transformer based supervised models

Xie, Ying, Song, Dongping

arXiv.org Artificial Intelligence

Obtaining labelled data in a particular context could be expensive and time consuming. Although different algorithms, including unsupervised learning, semi-supervised learning, self-learning have been adopted, the performance of text classification varies with context. Given the lack of labelled dataset, we proposed a novel and simple unsupervised text classification model to classify cargo content in international shipping industry using the Standard International Trade Classification (SITC) codes. Our method stems from representing words using pretrained Glove Word Embeddings and finding the most likely label using Cosine Similarity. To compare unsupervised text classification model with supervised classification, we also applied several Transformer models to classify cargo content. Due to lack of training data, the SITC numerical codes and the corresponding textual descriptions were used as training data. A small number of manually labelled cargo content data was used to evaluate the classification performances of the unsupervised classification and the Transformer based supervised classification. The comparison reveals that unsupervised classification significantly outperforms Transformer based supervised classification even after increasing the size of the training dataset by 30%. Lacking training data is a key bottleneck that prohibits deep learning models (such as Transformers) from successful practical applications. Unsupervised classification can provide an alternative efficient and effective method to classify text when there is scarce training data.


The different levels of autonomous vehicles - TechHQ

#artificialintelligence

We live in a fast-moving world that not long ago would have been considered science fiction. One aspect of technology that has driven us from fantasy into reality is the emergence of autonomous vehicles. Also known as self-driving cars, autonomous vehicles still confuse many people. How does the technology work? And what do we actually mean by work in the context of self-driving cars?


Motion Style Transfer: Modular Low-Rank Adaptation for Deep Motion Forecasting

Kothari, Parth, Li, Danya, Liu, Yuejiang, Alahi, Alexandre

arXiv.org Artificial Intelligence

Motion forecasting is an essential pillar for the successful deployment of autonomous systems in environments comprising various heterogeneous agents. It presents the challenges of modeling (i) universal etiquette (e.g., goal-directed behaviors, avoiding collisions) that govern general motion dynamics of all agents; and (ii) social norms (e.g., the minimum separation distance, preferred speed) that influence the navigation styles of different agents across different locations. Owing to the success of deep neural networks on large-scale datasets, learning prediction models in a data-driven manner has become a de-facto approach for motion forecasting and has shown impressive results [1, 2, 3, 4]. However, existing deep forecasting models suffer from inferior performance when they encounter novel scenarios [5, 6, 7, 8]. For instance, a network trained with large-scale data for pedestrian forecasting struggles to directly generalize to cyclists. Some recent methods propose to incorporate strong priors robust to the underlying distribution shifts [9, 10, 11]. Yet, these priors often make strong assumptions on the distribution shifts, which may not hold in practice.


Will cars ever be fully autonomous?

#artificialintelligence

Self-driving cars or autonomous vehicles are classified into various levels based on the level of automation built into them. Instead of a self-driving car, why not take the bus, you might ask. As you likely know, automated connected systems are no longer restricted to factories. They continue to percolate and expand in the daily thoroughfare of our lives. Gone are the days when owning and driving a car was a matter of privilege afforded by a select few.