Goto

Collaborating Authors

 codebase






Benchmarking and Analyzing 3D-aware Image Synthesis with a Modularized Codebase

Neural Information Processing Systems

Despite the rapid advance of 3D-aware image synthesis, existing studies usually adopt a mixture of techniques and tricks, leaving it unclear how each part contributes to the final performance in terms of generality. Following the most popular and effective paradigm in this field, which incorporates a neural radiance field (NeRF) into the generator of a generative adversarial network (GAN), we builda well-structured codebase through modularizing the generation process. Such a design allows researchers to develop and replace each module independently, and hence offers an opportunity to fairly compare various approaches and recognize their contributions from the module perspective. The reproduction of a range of cutting-edge algorithms demonstrates the availability of our modularized codebase. We also perform a variety of in-depth analyses, such as the comparison across different types of point feature, the necessity of the tailing upsampler in the generator, the reliance on the camera pose prior, etc., which deepen our understanding of existing methods and point out some further directions of the research work. Code and models will be made publicly available to facilitate the development and evaluation of this field.


DeepCode: Open Agentic Coding

Li, Zongwei, Li, Zhonghang, Guo, Zirui, Ren, Xubin, Huang, Chao

arXiv.org Artificial Intelligence

Recent advances in large language models (LLMs) have given rise to powerful coding agents, making it possible for code assistants to evolve into code engineers. However, existing methods still face significant challenges in achieving high-fidelity document-to-codebase synthesis--such as scientific papers to code--primarily due to a fundamental conflict between information overload and the context bottlenecks of LLMs. In this work, we introduce DeepCode, a fully autonomous framework that fundamentally addresses this challenge through principled information-flow management. By treating repository synthesis as a channel optimization problem, DeepCode seamlessly orchestrates four information operations to maximize task-relevant signals under finite context budgets: source compression via blueprint distillation, structured indexing using stateful code memory, conditional knowledge injection via retrieval-augmented generation, and closed-loop error correction. Extensive evaluations on the PaperBench benchmark demonstrate that DeepCode achieves state-of-the-art performance, decisively outperforming leading commercial agents such as Cursor and Claude Code, and crucially, surpassing PhD-level human experts from top institutes on key reproduction metrics. By systematically transforming paper specifications into production-grade implementations comparable to human expert quality, this work establishes new foundations for autonomous scientific reproduction that can accelerate research evaluation and discovery.


SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Deng, Xiang, Da, Jeff, Pan, Edwin, He, Yannis Yiming, Ide, Charles, Garg, Kanak, Lauffer, Niklas, Park, Andrew, Pasari, Nitin, Rane, Chetan, Sampath, Karmini, Krishnan, Maya, Kundurthy, Srivatsa, Hendryx, Sean, Wang, Zifan, Bharadwaj, Vijay, Holm, Jeff, Aluri, Raja, Zhang, Chen Bo Calvin, Jacobson, Noah, Liu, Bing, Kenstler, Brad

arXiv.org Artificial Intelligence

We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-BENCH [25], but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-BENCH. SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of 41 actively maintained repositories spanning business applications, B2B services, and developer tools. The benchmark is partitioned into a public set with open access to problems sourced from 11 repositories, a held-out set of 12 repositories and a commercial set of 18 proprietary repositories where we have formal partnership agreements with early-stage startups. Problems in the held-out and the commercial set are not publicly accessible, but we release results on the commercial set. Our benchmark features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. All tasks are human-verified and augmented with sufficient context to ensure resolvability. To better understand these limitations, we cluster the failure modes observed in the collected agent trajectories for a clearer characterization of the error patterns exhibited by current models. Overall, SWE-BENCH PRO provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.


A Unsupervised Learning of Compositional Energy Concepts Appendix

Neural Information Processing Systems

In this supplement, we provide additional empirical visualizations of our approach in Section A.1. Next, we provide details on experimental setup in Section A.2. Mean and standard deviation (s.d.) metric scores across 3 random seeds In COMET we utilize a residual network to parameterize an underlying energy function. We illustrate the underlying architecture of the energy function in Figure 2. The energy function takes as input an image at We remove normalization layers from our residual network. To infer global factors from an input image, we utilize a convolutional encoder in Figure 3. We illustrate the overall architecture in Figure 4. Training Details.



Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper

Miyai, Atsuyuki, Toyooka, Mashiro, Otonari, Takashi, Zhao, Zaiying, Aizawa, Kiyoharu

arXiv.org Artificial Intelligence

Understanding the current capabilities and risks of AI Scientist systems is essential for ensuring trustworthy and sustainable AI-driven scientific progress while preserving the integrity of the academic ecosystem. To this end, we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system that mimics the core research workflow of a novice student researcher: Given the baseline paper from the human mentor, it analyzes its limitations, formulates novel hypotheses for improvement, and iteratively conducts experiments until improvements are realized, and writes a paper with the results. Unlike previous approaches that assume full automation or operate on small-scale code, Jr. AI Scientist follows a well-defined research workflow and leverages modern coding agents to handle complex, multi-file implementations, leading to scientifically valuable contributions. Through our experiments, the Jr. AI Scientist successfully generated new research papers that build upon real NeurIPS, IJCV, and ICLR works by proposing and implementing novel methods. For evaluation, we conducted automated assessments using AI Reviewers, author-led evaluations, and submissions to Agents4Science, a venue dedicated to AI-driven scientific contributions. The findings demonstrate that Jr. AI Scientist generates papers receiving higher review scores than existing fully automated systems. Nevertheless, we identify important limitations from both the author evaluation and the Agents4Science reviews, indicating the potential risks of directly applying current AI Scientist systems and key challenges for future research. Finally, we comprehensively report various risks identified during development. We believe this study clarifies the current role and limitations of AI Scientist systems, offering insights into the areas that still require human expertise and the risks that may emerge as these systems evolve.