AITopics

Text clustering is a fundamental task in natural language processing, yet traditional clustering algorithms with pre-trained embeddings often struggle in domain-specific contexts without costly fine-tuning. Large language models (LLMs) provide strong contextual reasoning, yet prior work mainly uses them as auxiliary modules to refine embeddings or adjust cluster boundaries. We propose ClusterFusion, a hybrid framework that instead treats the LLM as the clustering core, guided by lightweight embedding methods. The framework proceeds in three stages: embedding-guided subset partition, LLM-driven topic summarization, and LLM-based topic assignment. This design enables direct incorporation of domain knowledge and user preferences, fully leveraging the contextual adaptability of LLMs. Experiments on three public benchmarks and two new domain-specific datasets demonstrate that ClusterFusion not only achieves state-of-the-art performance on standard tasks but also delivers substantial gains in specialized domains. To support future work, we release our newly constructed dataset and results on all benchmarks.

large language model, machine learning, natural language, (17 more...)

2512.0435

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Zhao, Zixiao, Fard, Fatemeh H., Wu, Jie JW

MANTRA: a Framework for Multi-stage Adaptive Noise TReAtment During Training

The reliable application of deep learning models to software engineering tasks hinges on high-quality training data. Yet, large-scale repositories inevitably introduce noisy or mislabeled examples that degrade both accuracy and robustness. While Noise Label Learning (NLL) has been extensively studied in other fields, there are a few works that investigate NLL in Software Engineering (SE) and Large Language Models (LLMs) for SE tasks. In this work, we propose MANTRA, a Multi-stage Adaptive Noise TReAtment framework that embeds noise diagnosis and mitigation directly into the fine-tuning process of code-Pretrained Language Models (PTM) and code-LLMs. We first investigate the effect of noise at varying levels on convergence and loss trajectories of the models. Then we apply an adaptive dropout strategy guided by per-sample loss dynamics and Gaussian Mixture Model clustering to exclude persistently noisy points while preserving clean data. Applying to code summarization and commit intent classification, our experiments reveal that some LLMs are more sensitive to noise than others. However, with MANTRA, the performance of all models in both tasks is improved. MANTRA enables researchers and practitioners to reduce the impact of errors introduced by the dataset in training, saves time in data cleaning and processing, while maximizing the effect of fine-tuning.

large language model, machine learning, natural language, (17 more...)

2512.04319

Country:

North America > United States (1.00)
Asia (0.67)
North America > Canada > British Columbia (0.28)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Fonseca, Rui, Martins, Bruno, Rocha, Gil

Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction

Image captioning has drawn considerable attention from the natural language processing and computer vision fields. Aiming to reduce the reliance on curated data, several studies have explored image captioning without any humanly-annotated image-text pairs for training, although existing methods are still outperformed by fully supervised approaches. This paper proposes TOMCap, i.e., an improved text-only training method that performs captioning without the need for aligned image-caption pairs. The method is based on prompting a pre-trained language model decoder with information derived from a CLIP representation, after undergoing a process to reduce the modality gap. W e specifically tested the combined use of retrieved examples of captions, and latent vector representations, to guide the generation process. Through extensive experiments, we show that TOMCap outperforms other training-free and text-only methods. W e also analyze the impact of different choices regarding the configuration of the retrieval-augmentation and modality gap reduction components.

caption, large language model, machine learning, (19 more...)

2512.04309

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

ResponsibleRobotBench: Benchmarking Responsible Robot Manipulation using Multi-modal Large Language Models

Zhang, Lei, Dong, Ju, Bai, Kaixin, Ni, Minheng, Marton, Zoltan-Csaba, Chen, Zhaopeng, Zhang, Jianwei

Recent advances in large multimodal models have enabled new opportunities in embodied AI, particularly in robotic manipulation. These models have shown strong potential in generalization and reasoning, but achieving reliable and responsible robotic behavior in real-world settings remains an open challenge. In high-stakes environments, robotic agents must go beyond basic task execution to perform risk-aware reasoning, moral decision-making, and physically grounded planning. We introduce ResponsibleRobotBench, a systematic benchmark designed to evaluate and accelerate progress in responsible robotic manipulation from simulation to real world. This benchmark consists of 23 multi-stage tasks spanning diverse risk types, including electrical, chemical, and human-related hazards, and varying levels of physical and planning complexity. These tasks require agents to detect and mitigate risks, reason about safety, plan sequences of actions, and engage human assistance when necessary. Our benchmark includes a general-purpose evaluation framework that supports multimodal model-based agents with various action representation modalities. The framework integrates visual perception, context learning, prompt construction, hazard detection, reasoning and planning, and physical execution. It also provides a rich multimodal dataset, supports reproducible experiments, and includes standardized metrics such as success rate, safety rate, and safe success rate. Through extensive experimental setups, ResponsibleRobotBench enables analysis across risk categories, task types, and agent configurations. By emphasizing physical reliability, generalization, and safety in decision-making, this benchmark provides a foundation for advancing the development of trustworthy, real-world responsible dexterous robotic systems. https://sites.google.com/view/responsible-robotbench

artificial intelligence, large language model, natural language, (18 more...)

2512.04308

Genre: Research Report (0.64)

Industry:

Information Technology (0.46)
Materials (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Evaluating Long-Context Reasoning in LLM-Based WebAgents

Chung, Andy, Zhang, Yichi, Lin, Kaixiang, Rawal, Aditya, Gao, Qiaozi, Chai, Joyce

As large language model (LLM)-based agents become increasingly integrated into daily digital interactions, their ability to reason across long interaction histories becomes crucial for providing personalized and contextually aware assistance. However, the performance of these agents in long context scenarios, particularly for action-taking WebAgents operating in realistic web environments, remains largely unexplored. This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents through sequentially dependent subtasks that require retrieval and application of information from extended interaction histories. We develop a novel evaluation framework that simulates multi-session user interactions by injecting irrelevant task trajectories between dependent subtasks, creating contexts ranging from 25,000 to 150,000 tokens. Through extensive evaluation of four popular models, Claude-3.7, GPT-4.1, Llama 4, and o4-mini, we observe a dramatic performance degradation as context length increases, with success rates dropping from 40-50\% in baseline conditions to less than 10\% in long context scenarios. Our detailed error analysis reveals that agents primarily fail due to getting stuck in loops and losing track of original task objectives. We further propose an implicit RAG approach that provides modest improvements by generating task-relevant summaries, though fundamental limitations in long context reasoning persist. These findings highlight critical challenges for deploying WebAgents in realistic, long-term user interaction scenarios and provide insights for developing more robust agent architectures capable of maintaining coherent task execution across extended contexts.

information, large language model, machine learning, (19 more...)

2512.04307

Country: North America > United States > California (0.46)

Genre:

Workflow (0.93)
Research Report > New Finding (0.93)

Industry:

Media (1.00)
Consumer Products & Services (1.00)
Transportation (0.93)
Leisure & Entertainment > Sports > Basketball (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Gondhalekar, Chinmay, Patel, Urjitkumar, Yeh, Fang-Chun

SQuARE: Structured Query & Adaptive Retrieval Engine For Tabular Formats

Abstract--Accurate question answering over real spreadsheets remains difficult due to multirow headers, merged cells, and unit annotations that disrupt naive chunking, while rigid SQL views fail on files lacking consistent schemas. It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation. A lightweight agent supervises retrieval, refinement, or combination of results across both paths when confidence is low. This design maintains header hierarchies, time labels, and units, ensuring that returned values are faithful to the original cells and straightforward to verify. Evaluated on multi-header corporate balance sheets, a heavily merged World Bank workbook, and diverse public datasets, SQuARE consistently surpasses single-strategy baselines and ChatGPT-4o on both retrieval precision and end-to-end answer accuracy while keeping latency predictable. By decoupling retrieval from model choice, the system is compatible with emerging tabular foundation models and offers a practical bridge toward a more robust table understanding. I. Introduction Spreadsheets constitute the predominant medium for quantitative analysis across numerous disciplines, particularly in the field of finance.

large language model, machine learning, natural language, (19 more...)

2512.04292

Genre: Research Report (0.65)

Industry: Banking & Finance (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Artificial Intelligence Applications in Horizon Scanning for Infectious Diseases

Miles, Ian, Wakimoto, Mayumi, Meira, Wagner Jr., Paula, Daniela, Ticiane, Daylene, Rosa, Bruno, Biddulph, Jane, Georgiou, Stelios, Ermida, Valdir

This review explores the integration of Artificial Intelligence into Horizon Scanning, focusing on identifying and responding to emerging threats and opportunities linked to Infectious Diseases. We examine how AI tools can enhance signal detection, data monitoring, scenario analysis, and decision support. We also address the risks associated with AI adoption and propose strategies for effective implementation and governance. The findings contribute to the growing body of Foresight literature by demonstrating the potential and limitations of AI in Public Health preparedness.

information, large language model, machine learning, (17 more...)

2512.04287

Country: North America > United States (0.93)

Genre:

Overview (1.00)
Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Public Health (1.00)
(3 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.93)
Information Technology > Artificial Intelligence > Applied AI (0.93)
(3 more...)

Quantitative Analysis of Technical Debt and Pattern Violation in Large Language Model Architectures

Slater, Tyler

As Large Language Models (LLMs) transition from code completion tools to autonomous system architects, their impact on long-term software maintainability remains unquantified. While existing research benchmarks functional correctness (pass@k), this study presents the first empirical framework to measure "Architectural Erosion" and the accumulation of Technical Debt in AI-synthesized microservices. We conducted a comparative pilot study of three state-of-the-art models (GPT-5.1, Claude 4.5 Sonnet, and Llama 3 8B) by prompting them to implement a standardized Book Lending Microservice under strict Hexagonal Architecture constraints. Utilizing Abstract Syntax Tree (AST) parsing, we find that while proprietary models achieve high architectural conformance (0% violation rate for GPT-5.1), open-weights models exhibit critical divergence. Specifically, Llama 3 demonstrated an 80% Architectural Violation Rate, frequently bypassing interface adapters to create illegal circular dependencies between Domain and Infrastructure layers. Furthermore, we identified a phenomenon of "Implementation Laziness," where open-weights models generated 60% fewer Logical Lines of Code (LLOC) than their proprietary counterparts, effectively omitting complex business logic to satisfy token constraints. These findings suggest that without automated architectural linting, utilizing smaller open-weights models for system scaffolding accelerates the accumulation of structural technical debt.

large language model, llama 3, machine learning, (15 more...)

2512.04273

Country: North America > United States (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Platt, Nolan, Luchs, Ethan, Nizamani, Sehrish

Catching UX Flaws in Code: Leveraging LLMs to Identify Usability Flaws at the Development Stage

Usability evaluations are essential for ensuring that modern interfaces meet user needs, yet traditional heuristic evaluations by human experts can be time-consuming and subjective, especially early in development. This paper investigates whether large language models (LLMs) can provide reliable and consistent heuristic assessments at the development stage. By applying Jakob Nielsen's ten usability heuristics to thirty open-source websites, we generated over 850 heuristic evaluations in three independent evaluations per site using a pipeline of OpenAI's GPT-4o. For issue detection, the model demonstrated moderate consistency, with an average pairwise Cohen's Kappa of 0.50 and an exact agreement of 84%. Severity judgments showed more variability: weighted Cohen's Kappa averaged 0.63, but exact agreement was just 56%, and Krippendorff's Alpha was near zero. These results suggest that while GPT-4o can produce internally consistent evaluations, especially for identifying the presence of usability issues, its ability to judge severity varies and requires human oversight in practice. Our findings highlight the feasibility and limitations of using LLMs for early-stage, automated usability testing, and offer a foundation for improving consistency in automated User Experience (UX) evaluation. To the best of our knowledge, our work provides one of the first quantitative inter-rater reliability analyses of automated heuristic evaluation and highlights methods for improving model consistency.

large language model, machine learning, natural language, (20 more...)

doi: 10.1109/VL-HCC65237.2025.00024

2512.04262

Country: North America > United States > Virginia (0.16)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Michelet, Gaëtan, Schneider, Janine, Withanage, Aruna, Breitinger, Frank

Hey GPT-OSS, Looks Like You Got It -- Now Walk Me Through It! An Assessment of the Reasoning Language Models Chain of Thought Mechanism for Digital Forensics

The use of large language models in digital forensics has been widely explored. Beyond identifying potential applications, research has also focused on optimizing model performance for forensic tasks through fine-tuning. However, limited result explainability reduces their operational and legal usability. Recently, a new class of reasoning language models has emerged, designed to handle logic-based tasks through an `internal reasoning' mechanism. Yet, users typically see only the final answer, not the underlying reasoning. One of these reasoning models is gpt-oss, which can be deployed locally, providing full access to its underlying reasoning process. This article presents the first investigation into the potential of reasoning language models for digital forensics. Four test use cases are examined to assess the usability of the reasoning component in supporting result explainability. The evaluation combines a new quantitative metric with qualitative analysis. Findings show that the reasoning component aids in explaining and validating language model outputs in digital forensics at medium reasoning levels, but this support is often limited, and higher reasoning levels do not enhance response quality.

large language model, machine learning, natural language, (19 more...)

2512.04254

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.90)