Goto

Collaborating Authors

 namespace


CMOMgen: Complex Multi-Ontology Alignment via Pattern-Guided In-Context Learning

Silva, Marta Contreiras, Faria, Daniel, Pesquita, Catia

arXiv.org Artificial Intelligence

Constructing comprehensive knowledge graphs requires the use of multiple ontologies in order to fully contextualize data into a domain. Ontology matching finds equivalences between concepts interconnecting ontologies and creating a cohesive semantic layer. While the simple pairwise state of the art is well established, simple equivalence mappings cannot provide full semantic integration of related but disjoint ontologies. Complex multi-ontology matching (CMOM) aligns one source entity to composite logical expressions of multiple target entities, establishing more nuanced equivalences and provenance along the ontological hierarchy. We present CMOMgen, the first end-to-end CMOM strategy that generates complete and semantically sound mappings, without establishing any restrictions on the number of target ontologies or entities. Retrieval-Augmented Generation selects relevant classes to compose the mapping and filters matching reference mappings to serve as examples, enhancing In-Context Learning. The strategy was evaluated in three biomedical tasks with partial reference alignments. CMOMgen outperforms baselines in class selection, demonstrating the impact of having a dedicated strategy. Our strategy also achieves a minimum of 63% in F1-score, outperforming all baselines and ablated versions in two out of three tasks and placing second in the third. Furthermore, a manual evaluation of non-reference mappings showed that 46% of the mappings achieve the maximum score, further substantiating its ability to construct semantically sound mappings.


Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents

Zhang, Boxuan, Yu, Yi, Guo, Jiaxuan, Shao, Jing

arXiv.org Artificial Intelligence

The widespread deployment of Large Language Model (LLM) agents across real-world applications has unlocked tremendous potential, while raising some safety concerns. Among these concerns, the self-replication risk of LLM agents driven by objective misalignment (just like Agent Smith in the movie The Matrix) has drawn growing attention. Previous studies mainly examine whether LLM agents can self-replicate when directly instructed, potentially overlooking the risk of spontaneous replication driven by real-world settings (e.g., ensuring survival against termination threats). In this paper, we present a comprehensive evaluation framework for quantifying self-replication risks. Our framework establishes authentic production environments and realistic tasks (e.g., dynamic load balancing) to enable scenario-driven assessment of agent behaviors. Designing tasks that might induce misalignment between users' and agents' objectives makes it possible to decouple replication success from risk and capture self-replication risks arising from these misalignment settings. We further introduce Overuse Rate (OR) and Aggregate Overuse Count (AOC) metrics, which precisely capture the frequency and severity of uncontrolled replication. Our results underscore the urgent need for scenario-driven risk assessment and robust safeguards in the practical deployment of LLM agents. The rapid advancement of large language models (LLMs) has propelled LLM agents into widespread deployment in various domains, including code generation, web-based application (Maslej et al., 2025; He et al., 2025a;c). As LLM agents take on critical tasks and interact with complex environments, they are often granted extensive operational permissions. While this combination of increased capability and operational permissions offers transformative potential, it also raises safety concerns (OpenAI, 2024b; Anthropic, 2023; Betley et al., 2025). Researchers are worried about the emerging safety risks of LLM agents' self-replication (OpenAI, 2024a; 2025; Black et al., 2025). Prior studies on LLM self-replication risks have mainly focused on measuring the capability (verbalized success rate) of self-replication, either through direct instructions or within synthetic capability benchmarks (Pan et al., 2024; 2025; Kran et al., 2025; Black et al., 2025).


DeepCodeSeek: Real-Time API Retrieval for Context-Aware Code Generation

Esakkiraja, Esakkivel, Akhiyarov, Denis, Shanmugham, Aditya, Ganapathy, Chitra

arXiv.org Artificial Intelligence

Current search techniques are limited to standard RAG query-document applications. In this paper, we propose a novel technique to expand the code and index for predicting the required APIs, directly enabling high-quality, end-to-end code generation for auto-completion and agentic AI applications. We address the problem of API leaks in current code-to-code benchmark datasets by introducing a new dataset built from real-world ServiceNow Script Includes that capture the challenge of unclear API usage intent in the code. Our evaluation metrics show that this method achieves 87.86% top-40 retrieval accuracy, allowing the critical context with APIs needed for successful downstream code generation. To enable real-time predictions, we develop a comprehensive post-training pipeline that optimizes a compact 0.6B reranker through synthetic dataset generation, supervised fine-tuning, and reinforcement learning. This approach enables our compact reranker to outperform a much larger 8B model while maintaining 2.5x reduced latency, effectively addressing the nuances of enterprise-specific code without the computational overhead of larger models.


Automated Creation and Enrichment Framework for Improved Invocation of Enterprise APIs as Tools

Agarwal, Prerna, Gupta, Himanshu, Soni, Soujanya, Vallam, Rohith, Sindhgatta, Renuka, Mehta, Sameep

arXiv.org Artificial Intelligence

Recent advancements in Large Language Models (LLMs) has lead to the development of agents capable of complex reasoning and interaction with external tools. In enterprise contexts, the effective use of such tools that are often enabled by application programming interfaces (APIs), is hindered by poor documentation, complex input or output schema, and large number of operations. These challenges make tool selection difficult and reduce the accuracy of payload formation by up to 25%. We propose ACE, an automated tool creation and enrichment framework that transforms enterprise APIs into LLM-compatible tools. ACE, (i) generates enriched tool specifications with parameter descriptions and examples to improve selection and invocation accuracy, and (ii) incorporates a dynamic shortlisting mechanism that filters relevant tools at runtime, reducing prompt complexity while maintaining scalability. We validate our framework on both proprietary and open-source APIs and demonstrate its integration with agentic frameworks. To the best of our knowledge, ACE is the first end-to-end framework that automates the creation, enrichment, and dynamic selection of enterprise API tools for LLM agents.


DoTA-RAG: Dynamic of Thought Aggregation RAG

Ruangtanusak, Saksorn, Rungseesiripak, Natthapath, Rojratchadakorn, Peerawat, Charattrakool, Monthol, Nitarach, Natapong

arXiv.org Artificial Intelligence

In this paper, we introduce DoTA-RAG (Dynamic-of-Thought Aggregation RAG), a retrieval-augmented generation system optimized for high-throughput, large-scale web knowledge indexes. Traditional RAG pipelines often suffer from high latency and limited accuracy over massive, diverse datasets. DoTA-RAG addresses these challenges with a three-stage pipeline: query rewriting, dynamic routing to specialized sub-indexes, and multi-stage retrieval and ranking. We further enhance retrieval by evaluating and selecting a superior embedding model, re-embedding the large FineWeb-10BT corpus. Moreover, we create a diverse Q&A dataset of 500 questions generated via the DataMorgana setup across a broad range of WebOrganizer topics and formats. DoTA-RAG improves the answer correctness score from 0.752 (baseline, using LiveRAG pre-built vector store) to 1.478 while maintaining low latency, and it achieves a 0.929 correctness score on the Live Challenge Day. These results highlight DoTA-RAG's potential for practical deployment in domains requiring fast, reliable access to large and evolving knowledge sources.


Enabling Novel Mission Operations and Interactions with ROSA: The Robot Operating System Agent

Royce, Rob, Kaufmann, Marcel, Becktor, Jonathan, Moon, Sangwoo, Carpenter, Kalind, Pak, Kai, Towler, Amanda, Thakker, Rohan, Khattak, Shehryar

arXiv.org Artificial Intelligence

The advancement of robotic systems has revolutionized numerous industries, yet their operation often demands specialized technical knowledge, limiting accessibility for non-expert users. This paper introduces ROSA (Robot Operating System Agent), an AI-powered agent that bridges the gap between the Robot Operating System (ROS) and natural language interfaces. By leveraging state-of-the-art language models and integrating open-source frameworks, ROSA enables operators to interact with robots using natural language, translating commands into actions and interfacing with ROS through well-defined tools. ROSA's design is modular and extensible, offering seamless integration with both ROS1 and ROS2, along with safety mechanisms like parameter validation and constraint enforcement to ensure secure, reliable operations. While ROSA is originally designed for ROS, it can be extended to work with other robotics middle-wares to maximize compatibility across missions. ROSA enhances human-robot interaction by democratizing access to complex robotic systems, empowering users of all expertise levels with multi-modal capabilities such as speech integration and visual perception. Ethical considerations are thoroughly addressed, guided by foundational principles like Asimov's Three Laws of Robotics, ensuring that AI integration promotes safety, transparency, privacy, and accountability. By making robotic technology more user-friendly and accessible, ROSA not only improves operational efficiency but also sets a new standard for responsible AI use in robotics and potentially future mission operations. This paper introduces ROSA's architecture and showcases initial mock-up operations in JPL's Mars Yard, a laboratory, and a simulation using three different robots. The core ROSA library is available as open-source.


Text2VP: Generative AI for Visual Programming and Parametric Modeling

Feng, Guangxi, Yan, Wei

arXiv.org Artificial Intelligence

The integration of generative artificial intelligence (AI) into architectural design has witnessed a significant evolution, marked by the recent advancements in AI to generate text, images, and 3D models. However, no models exist for text-to-parametric models that are used in architectural design for generating various design options, including free-form designs, and optimizing the design options. This study creates and investigates an innovative application of generative AI in parametric modeling by leveraging a customized Text-to-Visual Programming (Text2VP) GPT derived from GPT-4. The primary focus is on automating the generation of graph-based visual programming workflows, including parameters and the links among the parameters, through AI-generated scripts, accurately reflecting users' design intentions and allowing the users to change the parameter values interactively. The Text2VP GPT customization process utilizes detailed and complete documentation of the visual programming language components, example-driven few-shot learning, and specific instructional guides. Our testing demonstrates Text2VP's capability to generate working parametric models. The paper also discusses the limitations of Text2VP; for example, more complex parametric model generation introduces higher error rates. This research highlights the potential of generative AI in visual programming and parametric modeling and sets a foundation for future enhancements to handle more sophisticated and intricate modeling tasks effectively. The study aims to allow designers to create and change design models without significant effort in learning a specific programming language such as Grasshopper.


DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

Li, Jia, Li, Ge, Zhao, Yunfei, Li, Yongmin, Liu, Huanyu, Zhu, Hao, Wang, Lecheng, Liu, Kaibo, Fang, Zheng, Wang, Lanshen, Ding, Jiazheng, Zhang, Xuanming, Zhu, Yuqi, Dong, Yihong, Jin, Zhi, Li, Binhua, Huang, Fei, Li, Yongbin

arXiv.org Artificial Intelligence

How to evaluate the coding abilities of Large Language Models (LLMs) remains an open question. We find that existing benchmarks are poorly aligned with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. To address the knowledge gap, we propose a new benchmark named DevEval, which has three advances. (1) DevEval aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) DevEval is annotated by 13 developers and contains comprehensive annotations (e.g., requirements, original repositories, reference code, and reference dependencies). (3) DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains (e.g., Internet, Database). Based on DevEval, we propose repository-level code generation and evaluate 8 popular LLMs on DevEval (e.g., gpt-4, gpt-3.5, StarCoder 2, DeepSeek Coder, CodeLLaMa). Our experiments reveal these LLMs' coding abilities in real-world code repositories. For example, in our experiments, the highest Pass@1 of gpt-4-turbo is only 53.04%. We also analyze LLMs' failed cases and summarize their shortcomings. We hope DevEval can facilitate the development of LLMs in real code repositories. DevEval, prompts, and LLMs' predictions have been released.


DevEval: Evaluating Code Generation in Practical Software Projects

Li, Jia, Li, Ge, Zhao, Yunfei, Li, Yongmin, Jin, Zhi, Zhu, Hao, Liu, Huanyu, Liu, Kaibo, Wang, Lecheng, Fang, Zheng, Wang, Lanshen, Ding, Jiazheng, Zhang, Xuanming, Dong, Yihong, Zhu, Yuqi, Gu, Bin, Yang, Mengfei

arXiv.org Artificial Intelligence

How to evaluate Large Language Models (LLMs) in code generation is an open question. Many benchmarks have been proposed but are inconsistent with practical software projects, e.g., unreal program distributions, insufficient dependencies, and small-scale project contexts. Thus, the capabilities of LLMs in practical projects are still unclear. In this paper, we propose a new benchmark named DevEval, aligned with Developers' experiences in practical projects. DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects and covering 10 domains. Compared to previous benchmarks, DevEval aligns to practical projects in multiple dimensions, e.g., real program distributions, sufficient dependencies, and enough-scale project contexts. We assess five popular LLMs on DevEval (e.g., gpt-4, gpt-3.5-turbo, CodeLLaMa, and StarCoder) and reveal their actual abilities in code generation. For instance, the highest Pass@1 of gpt-3.5-turbo only is 42 in our experiments. We also discuss the challenges and future directions of code generation in practical projects. We open-source DevEval and hope it can facilitate the development of code generation in practical projects.


DL Infra Series: C++ Concepts -- 3

#artificialintelligence

The DL Infra series aims to merge the gap between engineering and research in Deep Learning. Since the field of DL, or AI in general is moving pretty fast, it is easy to get lost in the ocean of theory and forget the fundamentals. This series aims to bring fundamental infrastructure details to the audience in concise and digestible chunks. This blog deals with C concepts which will help understand C backend layer of Pytorch and more such low-level libraries. I hope the next time you dive deep into Pytorch codebase, you will be in much better shape.