Goto

Collaborating Authors

 Tang, Shixiang


ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

arXiv.org Artificial Intelligence

Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery: inspiration retrieval, hypothesis composition, and hypothesis ranking. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on papers published in 2024, ensuring minimal overlap with LLM pretraining data. Our evaluation reveals that LLMs perform well in retrieving inspirations, an out-of-distribution task, suggesting their ability to surface novel knowledge associations. This positions LLMs as "research hypothesis mines", capable of facilitating automated scientific discovery by generating innovative hypotheses at scale with minimal human intervention.


FuncGenFoil: Airfoil Generation and Editing Model in Function Space

arXiv.org Artificial Intelligence

Aircraft manufacturing is the jewel in the crown of industry, among which generating high-fidelity airfoil geometries with controllable and editable representations remains a fundamental challenge. While existing deep-learning-based methods rely on predefined parametric function families, e.g., B\'ezier curves and discrete point-based representations, they suffer from inherent trade-offs between expressiveness and resolution flexibility. To tackle this challenge, we introduce FuncGenFoil, a novel function-space generative model that directly learns functional airfoil geometries. Our method inherits both the advantages of arbitrary resolution sampling and the smoothness of parametric functions, as well as the strong expressiveness of discrete point-based functions. Empirical evaluations on the AFBench dataset demonstrate that FuncGenFoil improves upon state-of-the-art methods in airfoil generation by achieving a relative -74.4 label error reduction and +23.2 diversity increase on the AF-200K dataset. Our results highlight the advantages of function-space modeling for aerodynamic shape optimization, offering a powerful and flexible framework for high-fidelity airfoil design. Our code will be released.


Human-Centric Foundation Models: Perception, Generation and Agentic Modeling

arXiv.org Artificial Intelligence

In this survey, we present community appeals for a unified framework [Ci et al., 2023; a comprehensive overview of HcFMs by proposing Wang et al., 2023; Chen et al., 2024; Huang et al., 2024a] to a taxonomy that categorizes current approaches unlock systematic understanding and a wide range of humancentric into four groups: (1) Human-centric Perception applications for everybody. Foundation Models that capture fine-grained features Inspired by rapid advancements of general foundation models, for multi-modal 2D and 3D understanding; (2) e.g., large language models (LLMs), large vision models Human-centric AIGC Foundation Models that generate (LVMs) and text-to-image generative models, and their high-fidelity, diverse human-related content; presents of a paradigm shift from end-to-end learning of (3) Unified Perception and Generation Models that task-specific models to generalist models, a recent trend is integrate these capabilities to enhance both human to develop Human-centric Foundation Models (HcFM) that understanding and synthesis; and (4) Human-centric satisfy three criteria, namely generalization, broad applicability, Agentic Foundation Models that extend beyond perception and high fidelity. Generalization ensures robustness and generation to learn human-like intelligence to unseen conditions, enabling the model to perform consistently and interactive behaviors for humanoid embodied across varied environments.


Acquisition through My Eyes and Steps: A Joint Predictive Agent Model in Egocentric Worlds

arXiv.org Artificial Intelligence

This paper addresses the task of learning an agent model behaving like humans, which can jointly perceive, predict, and act in egocentric worlds. Previous methods usually train separate models for these three abilities, leading to information silos among them, which prevents these abilities from learning from each other and collaborating effectively. In this paper, we propose a joint predictive agent model, named EgoAgent, that simultaneously learns to represent the world, predict future states, and take reasonable actions with a single transformer. EgoAgent unifies the representational spaces of the three abilities by mapping them all into a sequence of continuous tokens. Learnable query tokens are appended to obtain current states, future states, and next actions. With joint supervision, our agent model establishes the internal relationship among these three abilities and effectively mimics the human inference and learning processes. Comprehensive evaluations of EgoAgent covering image classification, egocentric future state prediction, and 3D human motion prediction tasks demonstrate the superiority of our method. The code and trained model will be released for reproducibility.


Two Heads Are Better Than One: A Multi-Agent System Has the Potential to Improve Scientific Idea Generation

arXiv.org Artificial Intelligence

The rapid advancement of scientific progress requires innovative tools that can accelerate discovery. While recent AI methods, particularly large language models (LLMs), have shown promise in tasks such as hypothesis generation and experimental design, they fall short in replicating the collaborative nature of real-world scientific practices, where diverse teams of experts work together to tackle complex problems. To address the limitation, we propose an LLM-based multi-agent system, i.e., Virtual Scientists (VirSci), designed to mimic the teamwork inherent in scientific research. VirSci organizes a team of agents to collaboratively generate, evaluate, and refine research ideas. Through comprehensive experiments, we demonstrate that this multi-agent approach outperforms the state-of-the-art method in producing novel and impactful scientific ideas, showing potential in aligning with key insights in the Science of Science field. Our findings suggest that integrating collaborative agents can lead to more innovative scientific outputs, offering a robust system for autonomous scientific discovery.


DiffFluid: Plain Diffusion Models are Effective Predictors of Flow Dynamics

arXiv.org Artificial Intelligence

We showcase the plain diffusion models with Transformers are effective predictors of fluid dynamics under various working conditions, e.g., Darcy flow and high Reynolds number. Unlike traditional fluid dynamical solvers that depend on complex architectures to extract intricate correlations and learn underlying physical states, our approach formulates the prediction of flow dynamics as the image translation problem and accordingly leverage the plain diffusion model to tackle the problem. This reduction in model design complexity does not compromise its ability to capture complex physical states and geometric features of fluid dynamical equations, leading to high-precision solutions. In preliminary tests on various fluid-related benchmarks, our DiffFluid achieves consistent state-of-the-art performance, particularly in solving the Navier-Stokes equations in fluid dynamics, with a relative precision improvement of +44.8%. In addition, we achieved relative improvements of +14.0% and +11.3% in the Darcy flow equation and the airfoil problem with Euler's equation, respectively. Code will be released at https://github.com/DongyuLUO/DiffFluid upon acceptance.


MLCM: Multistep Consistency Distillation of Latent Diffusion Model

arXiv.org Artificial Intelligence

Distilling large latent diffusion models (LDMs) into ones that are fast to sample from is attracting growing research interest. However, the majority of existing methods face a dilemma where they either (i) depend on multiple individual distilled models for different sampling budgets, or (ii) sacrifice generation quality with limited (e.g., 2-4) and/or moderate (e.g., 5-8) sampling steps. To address these, we extend the recent multistep consistency distillation (MCD) strategy to representative LDMs, establishing the Multistep Latent Consistency Models (MLCMs) approach for low-cost high-quality image synthesis. MLCM serves as a unified model for various sampling steps due to the promise of MCD. We further augment MCD with a progressive training strategy to strengthen inter-segment consistency to boost the quality of few-step generations. We take the states from the sampling trajectories of the teacher model as training data for MLCMs to lift the requirements for high-quality training datasets and to bridge the gap between the training and inference of the distilled model. MLCM is compatible with preference learning strategies for further improvement of visual quality and aesthetic appeal. Empirically, MLCM can generate high-quality, delightful images with only 2-8 sampling steps. On the MSCOCO-2017 5K benchmark, MLCM distilled from SDXL gets a CLIP Score of 33.30, Aesthetic Score of 6.19, and Image Reward of 1.20 with only 4 steps, substantially surpassing 4-step LCM [23], 8-step SDXL-Lightning [17], and 8-step HyperSD [33]. We also demonstrate the versatility of MLCMs in applications including controllable generation, image style transfer, and Chinese-to-image generation.


DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

arXiv.org Artificial Intelligence

We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot object detection ability of multimodal large language models (MLLMs), such as GPT-4V and Gemini. Our approach consists of a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought to implement these prompts. Specifically, the prompts in the toolkit are designed to guide the MLLM to focus on regional information (e.g., zooming in), read coordinates according to measure standards (e.g., overlaying rulers and compasses), and infer from the contextual information (e.g., overlaying scene graphs). Building upon these tools, the new detection chain-of-thought can automatically decompose the task into simple subtasks, diagnose the predictions, and plan for progressive box refinements. The effectiveness of our framework is demonstrated across a spectrum of detection tasks, especially hard cases. Compared to existing state-of-the-art methods, GPT-4V with our DetToolChain improves state-of-the-art object detectors by +21.5% AP50 on MS COCO Novel class set for open-vocabulary detection, +24.23% Acc on RefCOCO val set for zero-shot referring expression comprehension, +14.5% AP on D-cube describe object detection FULL setting.


A Comprehensive Survey on 3D Content Generation

arXiv.org Artificial Intelligence

Recent years have witnessed remarkable advances in artificial intelligence generated content(AIGC), with diverse input modalities, e.g., text, image, video, audio and 3D. The 3D is the most close visual modality to real-world 3D environment and carries enormous knowledge. The 3D content generation shows both academic and practical values while also presenting formidable technical challenges. This review aims to consolidate developments within the burgeoning domain of 3D content generation. Specifically, a new taxonomy is proposed that categorizes existing approaches into three types: 3D native generative methods, 2D prior-based 3D generative methods, and hybrid 3D generative methods. The survey covers approximately 60 papers spanning the major techniques. Besides, we discuss limitations of current 3D content generation techniques, and point out open challenges as well as promising directions for future work. Accompanied with this survey, we have established a project website where the resources on 3D content generation research are provided. The project page is available at https://github.com/hitcslj/Awesome-AIGC-3D.


Hulk: A Universal Knowledge Translator for Human-Centric Tasks

arXiv.org Artificial Intelligence

Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks. The code will be available on https://github.com/OpenGVLab/HumanBench.