Goto

Collaborating Authors

 placeholder



Mercury: ACodeEfficiencyBenchmarkforCode LargeLanguageModels

Neural Information Processing Systems

Amidst therecent strides inevaluating LargeLanguage Models forCode (Code LLMs), existing benchmarks havemainly focused onthefunctional correctness of generated code, neglecting the importance of their computational efficiency.


Supplementary Contents

Neural Information Processing Systems

A.1 Motivation For what purpose was the dataset created? As an affiliated dataset, we created MIMIC-CXR-VQA to provide a benchmark for medical visual question answering systems. Who created the dataset (e.g., which team, research group) and on behalf of which Who funded the creation of the dataset? This work was (partially) supported by Microsoft Research Asia, Institute of Information & Communications Technology Planning & Evaluation (IITP) grant (No.2019-0-00075, RS-2022-00155958), National Research Foundation of Korea (NRF) grant (NRF-2020H1D3A2A03100945), and the Korea Health Industry Development Institute (KHIDI) What do the instances that comprise the dataset represent (e.g., documents, photos, EHRXQA contains natural questions and corresponding SQL/NeuralSQL queries (text). How many instances are there in total (of each type, if appropriate)? In EHRXQA, there are about 46.2K instances (16,366 image-related samples, 16,529 table-related samples, and 13,257 image+table-related samples).


AquaFusionNet: Lightweight VisionSensor Fusion Framework for Real-Time Pathogen Detection and Water Quality Anomaly Prediction on Edge Devices

Kristanto, Sepyan Purnama, Hakim, Lutfi, Hermansyah, null

arXiv.org Artificial Intelligence

Abstract--Evidence from many low-and middle-income regions shows that microbial contamination in small-scale drinking-water systems often fluctuates rapidly, yet existing monitoring tools capture only fragments of this behaviour . Microscopic imaging provides organism-level visibility, whereas physicochemical sensors reveal short-term changes in water chemistry; in practice, operators must interpret these streams separately, making real-time decision-making unreliable. This study introduces AquaFusionNet, a lightweight cross-modal framework that unifies both information sources inside a single edge-deployable model. Unlike prior work that treats microscopic detection and water-quality prediction as independent tasks, AquaFusionNet learns the statistical dependencies between microbial appearance and concurrent sensor dynamics through a gated cross-attention mechanism designed specifically for low-power hardware. The framework is trained on AquaMicro12K, a new dataset comprising 12,846 annotated 1000 micrographs curated for drinking-water contexts, an area where publicly accessible microscopic datasets are scarce. Deployed for six months across seven facilities in East Java, Indonesia, the system processed 1.84 million frames and consistently detected contamination events with 94.8% mAP@0.5 and 96.3% anomaly-prediction accuracy, while operating at 4.8 W on a Jetson Nano. Comparative experiments against representative lightweight detectors show that AquaFusionNet provides higher accuracy at comparable or lower power, and field results indicate that cross-modal coupling reduces common failure modes of unimodal detectors, particularly under fouling, turbidity spikes, and inconsistent illumination. All models, data, and hardware designs are released openly to facilitate replication and adaptation in decentralized water-safety infrastructures. Safe drinking water is a prerequisite for public health, yet it remains out of reach for a substantial fraction of the global population. Recent estimates from the WHO/UNICEF Joint Monitoring Programme indicate that 2.2 billion people still lack safely managed drinking-water services and that unsafe water, sanitation, and hygiene (W ASH) contribute to approximately 1.4 million deaths per year [1], [2].


An Efficient and Almost Optimal Solver for the Joint Routing-Assignment Problem via Partial JRA and Large-α Optimization

Yuan, Qilong

arXiv.org Artificial Intelligence

The Joint Routing-Assignment (JRA) optimization problem simultaneously determines the assignment of items to placeholders and a Hamiltonian cycle that visits each node pair exactly once, with the objective of minimizing total travel cost. Previous studies introduced an exact mixed-integer programming (MIP) solver, along with datasets and a Gurobi implementation, showing that while the exact approach guarantees optimality, it becomes computationally inefficient for large-scale instances. To overcome this limitation, heuristic methods based on merging algorithms and shaking procedures were proposed, achieving solutions within approximately 1% deviation from the optimum. This work presents a novel and more efficient approach that attains high-accuracy, near-optimal solutions for large-scale JRA problems. The proposed method introduces a Partial Path Reconstructon (PPR) solver that first identifies key item-placeholder pairs to form a reduced subproblem, which is solved efficiently to refine the global solution. Using this PJAR framework, the initial heuristic merging solutions can be further improved, reducing the deviation by half. Moreover, the solution can be iteratively polished with PPR based solver along the optimization path to yield highly accurate tours. Additionally, a global Large-α constraint is incorporated into the JRA model to further enhance solution optimality. Experimental evaluations on benchmark datasets with n = 300, 500, and 1000 demonstrate that the proposed method consistently delivers almost optimal solutions, achieving an average deviation of 0.00% from the ground truth while maintaining high computational efficiency. Beyond the JRA problem, the proposed framework and methodologies exhibit strong potential for broader applications. The Framework can be applied to TSP and related optimization problems.


DiagramIR: An Automatic Pipeline for Educational Math Diagram Evaluation

Kumar, Vishal, Mishra, Shubhra, Hao, Rebecca, Malik, Rizwaan, Broman, David, Demszky, Dorottya

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly being adopted as tools for learning; however, most tools remain text-only, limiting their usefulness for domains where visualizations are essential, such as mathematics. Recent work shows that LLMs are capable of generating code that compiles to educational figures, but a major bottleneck remains: scalable evaluation of these diagrams. We address this by proposing DiagramIR: an automatic and scalable evaluation pipeline for geometric figures. Our method relies on intermediate representations (IRs) of LaTeX TikZ code. We compare our pipeline to other evaluation baselines such as LLM-as-a-Judge, showing that our approach has higher agreement with human raters. This evaluation approach also enables smaller models like GPT-4.1-Mini to perform comparably to larger models such as GPT-5 at a 10x lower inference cost, which is important for deploying accessible and scalable education technologies.


KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

Ye, Hancheng, Gao, Zhengqi, Ma, Mingyuan, Wang, Qinsi, Fu, Yuzhe, Chung, Ming-Yu, Lin, Yueqian, Liu, Zhijian, Zhang, Jianyi, Zhuo, Danyang, Chen, Yiran

arXiv.org Machine Learning

Multi-agent large language model (LLM) systems are increasingly adopted for complex language processing tasks that require communication and coordination among agents. However, these systems often suffer substantial overhead from repeated reprocessing of overlapping contexts across agents. In typical pipelines, once an agent receives a message from its predecessor, the full context-including prior turns-must be reprocessed from scratch, leading to inefficient processing. While key-value (KV) caching is an effective solution for avoiding redundant computation in single-agent settings where prefixes remain unchanged, it cannot be directly reused in multi-agent scenarios due to diverging prefixes introduced by agent-specific context extensions. We identify that the core challenge lies in the offset variance of KV-caches across agents. To address this, we propose KVCOMM, a training-free framework that enables efficient prefilling in multi-agent inference by reusing KV-caches and aligning cache offsets of overlapping contexts under diverse prefix contexts. KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples-termed anchors-that store observed cache deviations under varying prefixes. The anchor pool is maintained and updated online, allowing dynamic adaptation to distinct user requests and context structures. KVCOMM achieves over 70% reuse rate across diverse multi-agent workloads, including retrieval-augmented generation, math reasoning, and collaborative coding tasks, all without quality degradation. Particularly, when each fully-connected agent receives 1K input tokens with 512 prefix tokens and 512 output tokens under a five-agent setting, KVCOMM achieves up to 7.8x speedup compared to the standard prefill pipeline, reducing TTFT from ~430 ms to ~55 ms.


LLM Based Long Code Translation using Identifier Replacement

Chakraborty, Manojit, Ghosh, Madhusudan, Gupta, Rishabh

arXiv.org Artificial Intelligence

In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don't fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.


LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?

He, Ziyuan, Wang, Yuxuan, Li, Jiaqi, Liang, Kexin, Zhang, Muhan

arXiv.org Artificial Intelligence

Large language models (LLMs) are equipped with increasingly extended context windows recently, yet their long context understanding capabilities over long dependency tasks remain fundamentally limited and underexplored. This gap is especially significant in many real-world long-context applications that were rarely benchmarked. In this paper, we introduce LooGLE v2, a novel benchmark designed to evaluate LLMs' long context ability in real-world applications and scenarios. Our benchmark consists of automatically collected real-world long texts, ranging from 16k to 2M tokens, encompassing domains in law, finance, game and code. Accordingly, we delicately design 10 types of domain-specific long-dependency tasks and generate 1,934 QA instances with various diversity and complexity in a scalable data curation pipeline for further practical needs. We conduct a comprehensive assessment of 6 locally deployed and 4 API-based LLMs. The evaluation results show that even the best-performing model achieves only a 59.2% overall score on our benchmark. Despite the extensive context windows, popular LLMs are only capable of understanding a much shorter length of context than they claim to be, revealing significant limitations in their ability to handle real-world tasks with long dependencies and highlighting substantial room for model improvement in practical long-context understanding.


ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks

He, Liyang, Zhang, Yuren, Zhu, Ziwei, Li, Zhenghui, Tong, Shiwei

arXiv.org Artificial Intelligence

Retrieval Augmented Generation (RAG) systems are increasingly vital in dynamic domains like online gaming, yet the lack of a dedicated benchmark has impeded standardized evaluation in this area. The core difficulty lies in Dual Dynamics: the constant interplay between game content updates and the shifting focus of the player community. Furthermore, the necessity of automating such a benchmark introduces a critical requirement for player-centric authenticity to ensure generated questions are realistic. To address this integrated challenge, we introduce ChronoPlay, a novel framework for the automated and continuous generation of game RAG benchmarks. ChronoPlay utilizes a dual-dynamic update mechanism to track both forms of change, and a dual-source synthesis engine that draws from official sources and player community to ensure both factual correctness and authentic query patterns. We instantiate our framework on three distinct games to create the first dynamic RAG benchmark for the gaming domain, offering new insights into model performance under these complex and realistic conditions. Code is avaliable at: https://github.com/hly1998/ChronoPlay.