Large Language Model
O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents
Wang, Piaohong, Tian, Motong, Li, Jiaxian, Liang, Yuan, Wang, Yuqing, Chen, Qianben, Wang, Tiannan, Lu, Zhicong, Ma, Jiawei, Jiang, Yuchen Eleanor, Zhou, Wangchunshu
Recent advancements in LLM-powered agents have demonstrated significant potential in generating human-like responses; however, they continue to face challenges in maintaining long-term interactions within complex environments, primarily due to limitations in contextual consistency and dynamic personalization. Existing memory systems often depend on semantic grouping prior to retrieval, which can overlook semantically irrelevant yet critical user information and introduce retrieval noise. In this report, we propose the initial design of O-Mem, a novel memory framework based on active user profiling that dynamically extracts and updates user characteristics and event records from their proactive interactions with agents. O-Mem supports hierarchical retrieval of persona attributes and topic-related context, enabling more adaptive and coherent personalized responses. O-Mem achieves 51.67% on the public LoCoMo benchmark, a nearly 3% improvement upon LangMem,the previous state-of-the-art, and it achieves 62.99% on PERSONAMEM, a 3.5% improvement upon A-Mem,the previous state-of-the-art. O-Mem also boosts token and interaction response time efficiency compared to previous memory frameworks. Our work opens up promising directions for developing efficient and human-like personalized AI assistants in the future.
ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning
Chowdhury, MD Thamed Bin Zaman, Hossain, Moazzem
ABSTRACT Reliable geospatial information on road accidents is vital for safety analysis and infrastructure planning, yet most low-and middle-income countries continue to face a critical shortage of accurate, location-specific crash data. Existing text-based geocoding tools perform poorly in multilingual and unstructured news environments, where incomplete place descriptions and mixed language (e.g. To address these limitations, this study introduces ALIGN (Accident Location Inference through Geo-Spatial Neural Reasoning) -- a vision-language framework that emulates human spatial reasoning to infer accident location coordinates directly from available textual and map-based cues. ALIGN integrates large language and vision-language model mechanisms within a multi-stage pipeline that performs optical character recognition, linguistic reasoning, and map-level verification through grid-based spatial scanning. The framework systematically evaluates each predicted location against contextual and visual evidence, ensuring interpretable, fine-grained geolocation outcomes without requiring model retraining. Applied to Bangla-language news data source, ALIGN demonstrates consistent improvements over traditional geoparsing methods, accurately identifying district-and sub-district-level crash sites. Beyond its technical contribution, the framework establishes a high accuracy foundation for automated crash mapping in data-scarce regions, supporting evidence-driven road-safety policymaking and the broader integration of multimodal artificial intelligence in transportation analytics. Hossain) 1. Introduction Accurate, fine-grained geospatial data is the bedrock of effective public safety policy, urban planning, and strategic response. For road safety, knowing the precise location of traffic crashes is essential for diagnosing high-risk black spots, deploying emergency services, and evaluating the impact of engineering interventions. While high-income nations increasingly rely on robust, integrated crash databases and vehicle telematics (Guo, Qian, & Shi, 2022; Szpytko & Nasan Agha, 2020), utilizing advanced methods such as deep learning on multi-vehicle trajectories (Yang et al., 2021), ensemble models integrating connected vehicle data (Yang et al., 2026), and 2 probe vehicle speed contour analysis (Wang et al., 2021), a significant'geospatial data desert' persists in most Low-and Middle-Income Countries (LMICs) (Mitra & Bhalla, 2023; Chang et al., 2020). This gap is particularly tragic given that these regions bear the overwhelming brunt of global road traffic fatalities. This research focuses on a low-resource country-Bangladesh, a nation that exemplifies this critical data-sparse challenge. The World Bank has estimated that the costs associated with traffic crashes can amount to as much as 5.1% of the country's Gross Domestic Product (World Bank, 2022).
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
Li, Jonathan, Farahini, Nasim, Iuliugin, Evgenii, Vesterlund, Magnus, Hรคggstrรถm, Christian, Wang, Guangtao, Upasani, Shubhangi, Sachdeva, Ayush, Li, Rui, Fu, Faline, Wu, Chen, Siddiqua, Ayesha, Long, John, Zhao, Tuowen, Musaddiq, Matheen, Zeffer, Hรฅkan, Du, Yun, Wang, Mingran, Li, Qinghua, Li, Bo, Thakker, Urmish, Prabhakar, Raghu
The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables $4\times$ improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of sparse KV attention techniques deployed in a production inference system with static graphs and continuous batching.
Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards
Groeneveld, Jan Niklas, Qin, Xi, Schaefer, Alexander, Oren, Yaad
Generating high-quality code remains a challenge for Large Language Models (LLMs). For the evolution of reasoning models on this task, reward models are a necessary intermediate step. These models judge outcomes or intermediate steps. Decoder-only transformer models can be turned into reward models by introducing a regression layer and supervised fine-tuning. While it is known that reflection capabilities generally increase with the size of a model, we want to investigate whether state-of-the-art small language models like the Phi-4 family can be turned into usable reward models blending the consideration of process rewards and outcome rewards. Targeting this goal, we construct a dataset of code samples with correctness labels derived from the APPS coding challenge benchmark. We then train a value-head model to estimate the success probability of intermediate outputs. Our evaluation shows that small LLMs are capable of serving as effective reward models or code evaluation critics, successfully identifying correct solutions among multiple candidates. Using this critic, we achieve over a 20% improvement in the search capability of the most accurate code out of multiple generations.
Attention Sinks in Diffusion Language Models
Rulli, Maximo Eduardo, Petruzzi, Simone, Michielon, Edoardo, Silvestri, Fabrizio, Scardapane, Simone, Devoto, Alessio
Masked Diffusion Language Models (DLMs) have recently emerged as a promising alternative to traditional Autoregressive Models (ARMs). DLMs employ transformer encoders with bidirectional attention, enabling parallel token generation while maintaining competitive performance. Although their efficiency and effectiveness have been extensively studied, the internal mechanisms that govern DLMs remain largely unexplored. In this work, we conduct an empirical analysis of DLM attention patterns, focusing on the attention sinking phenomenon, an effect previously observed in various transformer-based architectures. Our findings reveal that DLMs also exhibit attention sinks, but with distinct characteristics. First, unlike in ARMs, the sink positions in DLMs tend to shift throughout the generation process, displaying a dynamic behaviour. Second, while ARMs are highly sensitive to the removal of attention sinks, DLMs remain robust: masking sinks leads to only a minor degradation in performance. These results provide new insights into the inner workings of diffusion-based language models and highlight fundamental differences in how they allocate and utilize attention compared to autoregressive models.
Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References
Chen, Hongzheng, Fan, Bin, Collins, Alexander, Hagedorn, Bastian, Gaburov, Evghenii, Masuda, Masahiro, Brookhart, Matthew, Sullivan, Chris, Knight, Jason, Zhang, Zhiru, Grover, Vinod
Modern GPUs feature specialized hardware units that enable high-performance, asynchronous dataflow execution. However, the conventional SIMT programming model is fundamentally misaligned with this task-parallel hardware, creating a significant programmability gap. While hardware-level warp specialization is the key to unlocking peak performance, it forces developers to manually orchestrate complex, low-level communication and software pipelines--a process that is labor-intensive, error-prone, and unsustainable. To address this challenge, we present Tawa, an automated compiler that systematically generates high-performance, warp-specialized code from a high-level, tile-based program. Central to our approach is a novel IR abstraction, asynchronous references (aref), which expresses warp-level communication without exposing low-level hardware details. Using this abstraction, Tawa automatically partitions programs into producer-consumer roles and manages the intricate dataflow pipeline, relieving developers of invasive kernel rewriting. Evaluation on NVIDIA H100 GPUs across representative LLM kernels shows that Tawa delivers high hardware utilization, achieving up to 1.1$\times$ speedup over highly optimized cuBLAS GEMM kernels. For attention workloads, Tawa attains 1.2$\times$ speedup over Triton and matches the performance of the hand-optimized CUTLASS C++ FlashAttention-3 kernel with far less programming effort.
GRAVITY: A Framework for Personalized Text Generation via Profile-Grounded Synthetic Preferences
Dey, Priyanka, Rosa, Daniele, Zheng, Wenqing, Barcklow, Daniel, Zhao, Jieyu, Ferrara, Emilio
Personalization in LLMs often relies on costly human feedback or interaction logs, limiting scalability and neglecting deeper user attributes. To reduce the reliance on human annotations, we introduce GRAVITY (Generative Response with Aligned Values, Interests, and Traits of You), a framework for generating synthetic, profile-grounded preference data that captures users' interests, values, beliefs, and personality traits. By integrating demographic, cultural, and psychological frameworks -- including Hofstede's cultural dimensions, Schwartz's basic values, the World Values Survey, and Big Five OCEAN traits -- GRAVITY synthesizes preference pairs to guide personalized content generation. We evaluate GRAVITY on book descriptions for 400 Amazon users, comparing it to prompt-based conditioning, standard fine-tuning, and naive synthetic pair generation. Profile-grounded synthetic data consistently improves generation, especially across multiple cultures (USA, Brazil, Japan, India), achieving over 4% higher preference gains across baselines, with user studies showing that GRAVITY outputs are preferred over 86% of the time. Our results show that scenario-grounded synthetic data can capture richer user variation, reduce reliance on costly annotation, and produce more engaging, user-centered content, offering a scalable path for LLM personalization.
The Adoption Paradox for Veterinary Professionals in China: High Use of Artificial Intelligence Despite Low Familiarity
While the global integration of artificial intelligence (AI) into veterinary medicine is accelerating, its adoption dynamics in major markets such as China remain uncharacterized. This paper presents the first exploratory analysis of AI perception and adoption among veterinary professionals in China, based on a cross-sectional survey of 455 practitioners conducted in mid-2025. We identify a distinct "adoption paradox": although 71.0% of respondents have incorporated AI into their workflows, 44.6% of these active users report low familiarity with the technology. In contrast to the administrative-focused patterns observed in North America, adoption in China is practitioner-driven and centers on core clinical tasks, such as disease diagnosis (50.1%) and prescription calculation (44.8%). However, concerns regarding reliability and accuracy remain the primary barrier (54.3%), coexisting with a strong consensus (93.8%) for regulatory oversight. These findings suggest a unique "inside-out" integration model in China, characterized by high clinical utility but restricted by an "interpretability gap," underscoring the need for specialized tools and robust regulatory frameworks to safely harness AI's potential in this expanding market.
Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception
Theodoridis, Nikos, Brophy, Tim, Mohandas, Reenu, Sistu, Ganesh, Collins, Fiachra, Scanlan, Anthony, Eising, Ciaran
Vision-Language Models (VLMs) are becoming increasingly powerful, demonstrating strong performance on a variety of tasks that require both visual and textual understanding. Their strong generalisation abilities make them a promising component for automated driving systems, which must handle unexpected corner cases. However, to be trusted in such safety-critical applications, a model must first possess a reliable perception system. Moreover, since critical objects and agents in traffic scenes are often at a distance, we require systems that are not "shortsighted", i.e., systems with strong perception capabilities at both close (up to 20 meters) and long (30+ meters) range. With this in mind, we introduce Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes, enriched with distance annotations. By excluding questions that require reasoning, we ensure that model performance reflects perception capabilities alone. Since automated driving hardware has limited processing power and cannot support large VLMs, our study centers on smaller VLMs. More specifically, we evaluate several state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the simplicity of the questions, these models significantly underperform compared to humans (~60% average accuracy for the best-performing small VLM versus ~85% human performance). However, it is important to note that the human sample size was relatively small, which imposes statistical limitations. We also identify specific perception tasks, such as distinguishing left from right, that remain particularly challenging for these models.
Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics
Song, Maojia, Liu, Renhang, Wang, Xinyu, Jiang, Yong, Xie, Pengjun, Huang, Fei, Zhou, Jingren, Herremans, Dorien, Poria, Soujanya
RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.