Goto

Collaborating Authors

 Pan, Zhenyu


MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse

arXiv.org Artificial Intelligence

We present MetaSpatial, the first reinforcement learning (RL)-based framework designed to enhance 3D spatial reasoning in vision-language models (VLMs), enabling real-time 3D scene generation without the need for hard-coded optimizations. MetaSpatial addresses two core challenges: (i) the lack of internalized 3D spatial reasoning in VLMs, which limits their ability to generate realistic layouts, and (ii) the inefficiency of traditional supervised fine-tuning (SFT) for layout generation tasks, as perfect ground truth annotations are unavailable. Our key innovation is a multi-turn RL-based optimization mechanism that integrates physics-aware constraints and rendered image evaluations, ensuring generated 3D layouts are coherent, physically plausible, and aesthetically consistent. Methodologically, MetaSpatial introduces an adaptive, iterative reasoning process, where the VLM refines spatial arrangements over multiple turns by analyzing rendered outputs, improving scene coherence progressively. Empirical evaluations demonstrate that MetaSpatial significantly enhances the spatial consistency and formatting stability of various scale models. Post-training, object placements are more realistic, aligned, and functionally coherent, validating the effectiveness of RL for 3D spatial reasoning in metaverse, AR/VR, digital twins, and game development applications.


Retrieval-Augmented Generation with Hierarchical Knowledge

arXiv.org Artificial Intelligence

Graph-based Retrieval-Augmented Generation (RAG) methods have significantly enhanced the performance of large language models (LLMs) in domain-specific tasks. However, existing RAG methods do not adequately utilize the naturally inherent hierarchical knowledge in human cognition, which limits the capabilities of RAG systems. In this paper, we introduce a new RAG approach, called HiRAG, which utilizes hierarchical knowledge to enhance the semantic understanding and structure capturing capabilities of RAG systems in the indexing and retrieval processes. Our extensive experiments demonstrate that HiRAG achieves significant performance improvements over the state-of-the-art baseline methods. The code of our proposed method is available at \href{https://github.com/hhy-huang/HiRAG}{https://github.com/hhy-huang/HiRAG}.


Do Code LLMs Understand Design Patterns?

arXiv.org Artificial Intelligence

Code Large Language Models (LLMs) demonstrate great versatility in adapting to various downstream tasks, including code generation and completion, as well as bug detection and fixing. However, Code LLMs often fail to capture existing coding standards, leading to the generation of code that conflicts with the required design patterns for a given project. As a result, developers must post-process to adapt the generated code to the project's design norms. In this work, we empirically investigate the biases of Code LLMs in software development. Through carefully designed experiments, we assess the models' understanding of design patterns across recognition, comprehension, and generation. Our findings reveal that biases in Code LLMs significantly affect the reliability of downstream tasks.


Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?

arXiv.org Artificial Intelligence

Code completion, a key downstream task in code generation, is one of the most frequent and impactful methods for enhancing developer productivity in software development. As intelligent completion tools evolve, we need a robust evaluation benchmark that enables meaningful comparisons between products and guides future advancements. However, existing benchmarks focus more on coarse-grained tasks without industrial analysis resembling general code generation rather than the real-world scenarios developers encounter. Moreover, these benchmarks often rely on costly and time-consuming human annotation, and the standalone test cases fail to leverage minimal tests for maximum repository-level understanding and code coverage. To address these limitations, we first analyze business data from an industrial code completion tool and redefine the evaluation criteria to better align with the developer's intent and desired completion behavior throughout the coding process. Based on these insights, we introduce Codev-Agent, an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage, ensuring fair and effective comparisons. Using Codev-Agent, we present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Bench assesses whether a code completion tool can capture a developer's immediate intent and suggest appropriate code across diverse contexts, providing a more realistic benchmark for code completion in modern software development.


Conv-CoA: Improving Open-domain Question Answering in Large Language Models via Conversational Chain-of-Action

arXiv.org Artificial Intelligence

We present a Conversational Chain-of-Action (Conv-CoA) framework for Open-domain Conversational Question Answering (OCQA). Compared with literature, Conv-CoA addresses three major challenges: (i) unfaithful hallucination that is inconsistent with real-time or domain facts, (ii) weak reasoning performance in conversational scenarios, and (iii) unsatisfying performance in conversational information retrieval. Our key contribution is a dynamic reasoning-retrieval mechanism that extracts the intent of the question and decomposes it into a reasoning chain to be solved via systematic prompting, pre-designed actions, updating the Contextual Knowledge Set (CKS), and a novel Hopfield-based retriever. Methodologically, we propose a resource-efficiency Hopfield retriever to enhance the efficiency and accuracy of conversational information retrieval within our actions. Additionally, we propose a conversational-multi-reference faith score (Conv-MRFS) to verify and resolve conflicts between retrieved knowledge and answers in conversations. Empirically, we conduct comparisons between our framework and 23 state-of-the-art methods across five different research directions and two public benchmarks. These comparisons demonstrate that our Conv-CoA outperforms other methods in both the accuracy and efficiency dimensions.


HeteGraph-Mamba: Heterogeneous Graph Learning via Selective State Space Model

arXiv.org Artificial Intelligence

We propose a heterogeneous graph mamba network (HGMN) as the first exploration in leveraging the selective state space models (SSSMs) for heterogeneous graph learning. Compared with the literature, our HGMN overcomes two major challenges: (i) capturing long-range dependencies among heterogeneous nodes and (ii) adapting SSSMs to heterogeneous graph data. Our key contribution is a general graph architecture that can solve heterogeneous nodes in real-world scenarios, followed an efficient flow. Methodologically, we introduce a two-level efficient tokenization approach that first captures long-range dependencies within identical node types, and subsequently across all node types. Empirically, we conduct comparisons between our framework and 19 state-of-the-art methods on the heterogeneous benchmarks. The extensive comparisons demonstrate that our framework outperforms other methods in both the accuracy and efficiency dimensions.


Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models

arXiv.org Artificial Intelligence

We present a Chain-of-Action (CoA) framework for multimodal and retrieval-augmented Question-Answering (QA). Compared to the literature, CoA overcomes two major challenges of current QA applications: (i) unfaithful hallucination that is inconsistent with real-time or domain facts and (ii) weak reasoning performance over compositional information. Our key contribution is a novel reasoning-retrieval mechanism that decomposes a complex question into a reasoning chain via systematic prompting and pre-designed actions. Methodologically, we propose three types of domain-adaptable `Plug-and-Play' actions for retrieving real-time information from heterogeneous sources. We also propose a multi-reference faith score (MRFS) to verify and resolve conflicts in the answers. Empirically, we exploit both public benchmarks and a Web3 case study to demonstrate the capability of CoA over other methods.


CoRMF: Criticality-Ordered Recurrent Mean Field Ising Solver

arXiv.org Machine Learning

We propose an RNN-based efficient Ising model solver, the Criticality-ordered Recurrent Mean On one hand, the connection between NP problems and Field (CoRMF), for forward Ising problems. In Ising models has resulted in strong physics intuitions [Kirkpatrick its core, a criticality-ordered spin sequence of et al., 1983] that the hardness of these problems an N-spin Ising model is introduced by sorting emerges through the lens of complex energy landscapes mission-critical edges with greedy algorithm, over discrete random variables with multiple local minima such that an autoregressive mean-field factorization [Barahona, 1982, Chowdhury, 2014]. On the other hand, can be utilized and optimized with Recurrent the computational difficulty on the Ising side resonates with Neural Networks (RNNs). Our method the difficulties of numerous significant scientific problems, has two notable characteristics: (i) by leveraging including numerous other combinatorial decision-making the approximated tree structure of the underlying and optimization problems [Benati and Rizzi, 2007, Ngo Ising graph, the newly-obtained criticality et al., 1994, Garey and Johnson, 1979]. As the opposite order enables the unification between variational of conventional inverse Ising problems [Nguyen et al., mean-field and RNN, allowing the generally 2017, Reneau et al., 2023] that reconstruct graphical structure intractable Ising model to be efficiently from data, we refer to these problems, which have probed with probabilistic inference; (ii) it is wellmodulized, pre-specified graphical structures, as forward Ising problems model-independent while at the same (combinatorial inference and optimization problems time expressive enough, and hence fully applicable in Ising formulations [De las Cuevas and Cubitt, 2016, Lucas, to any forward Ising inference problems 2014, Pan et al., 2023]), and any efficient computational with minimal effort. Computationally, by using method or hardware solver [Mohseni et al., 2022] a variance-reduced Monte Carlo gradient estimator, for Ising models can potentially benefit them. CoRFM solves the Ising problems in a selftrain To describe the Ising model, we first introduce some notation fashion without data/evidence, and the inference here. We consider an Ising model of N spins as an exponential family model for binary N-spin data up to tasks can be executed by directly sampling quadratic sufficient statistic taking the Boltzmann form from RNN. Theoretically, we establish a provably tighter error bound than naive meanfield