Xie, Yu
UniGenX: Unified Generation of Sequence and Structure with Autoregressive Diffusion
Zhang, Gongbo, Li, Yanting, Luo, Renqian, Hu, Pipi, Zhao, Zeru, Li, Lingbo, Liu, Guoqing, Wang, Zun, Bi, Ran, Gao, Kaiyuan, Guo, Liya, Xie, Yu, Liu, Chang, Zhang, Jia, Xie, Tian, Pinsler, Robert, Zeni, Claudio, Lu, Ziheng, Xia, Yingce, Segler, Marwin, Riechert, Maik, Yuan, Li, Chen, Lei, Liu, Haiguang, Qin, Tao
Unified generation of sequence and structure for scientific data (e.g., materials, molecules, proteins) is a critical task. Existing approaches primarily rely on either autoregressive sequence models or diffusion models, each offering distinct advantages and facing notable limitations. Autoregressive models, such as GPT, Llama, and Phi-4, have demonstrated remarkable success in natural language generation and have been extended to multimodal tasks (e.g., image, video, and audio) using advanced encoders like VQ-VAE to represent complex modalities as discrete sequences. However, their direct application to scientific domains is challenging due to the high precision requirements and the diverse nature of scientific data. On the other hand, diffusion models excel at generating high-dimensional scientific data, such as protein, molecule, and material structures, with remarkable accuracy. Yet, their inability to effectively model sequences limits their potential as general-purpose multimodal foundation models. To address these challenges, we propose UniGenX, a unified framework that combines autoregressive next-token prediction with conditional diffusion models. This integration leverages the strengths of autoregressive models to ease the training of conditional diffusion models, while diffusion-based generative heads enhance the precision of autoregressive predictions. We validate the effectiveness of UniGenX on material and small molecule generation tasks, achieving a significant leap in state-of-the-art performance for material crystal structure prediction and establishing new state-of-the-art results for small molecule structure prediction, de novo design, and conditional generation. Notably, UniGenX demonstrates significant improvements, especially in handling long sequences for complex structures, showcasing its efficacy as a versatile tool for scientific data generation.
Uncovering inequalities in new knowledge learning by large language models across different languages
Wang, Chenglong, Tang, Haoyu, Yang, Xiyuan, Xie, Yueqi, Suh, Jina, Sitaram, Sunayana, Huang, Junming, Xie, Yu, Gong, Zhaoya, Xie, Xing, Wu, Fangzhao
Existing research has primarily focused on static analyses that assess the disparities in the existing knowledge and capabilities of LLMs across languages. However, LLMs are continuously evolving, acquiring new knowledge to generate up-to-date, domain-specific responses. Investigating linguistic inequalities within this dynamic process is, therefore, also essential. In this paper, we explore inequalities in new knowledge learning by LLMs across different languages and four key dimensions: effectiveness, transferability, prioritization, and robustness. Through extensive experiments under two settings (in-context learning and fine-tuning) using both proprietary and open-source models, we demonstrate that low-resource languages consistently face disadvantages across all four dimensions. By shedding light on these disparities, we aim to raise awareness of linguistic inequities in LLMs' new knowledge learning, fostering the development of more inclusive and equitable future LLMs. This transformation is both inevitable and global in scale. One notable example is ChatGPT, which, as of December 2024, serves 300 million weekly active users worldwide (6, 7). Given such widespread adoption, it is crucial to study fairness in multilingual environments to ensure that users of different languages can benefit equally from these systems (9). Existing research on multilingual equality in LLMs primarily focuses on static analyses that evaluate disparities in the knowledge and capabilities of LLMs across different languages (10, 11, 12, 13, 14, 15, 16, 17). Some studies, for example, have examined the amount of factual knowledge encoded in different languages and revealed significant variations. In particular, they reveal that knowledge available in low-resource languages remains limited due to the lack of pre-training data in these languages (18, 19, 20). These studies have significantly advanced our understanding of the extent and nature of multilingual inequalities in LLMs' existing knowledge and capabilities. However, we still lack an understanding of inequalities in the process of acquiring new knowledge, an evolving perspective in research on LLMs. Learning new knowledge is crucial for LLMs, as illustrated in Figure 1a. On the one hand, general-purpose LLMs are pre-trained on static datasets that were collected prior to training and may not include real-time or recent information. As a result, these models do not possess new knowledge, and their knowledge base can quickly become outdated.
Variance reduction in output from generative AI
Xie, Yu, Xie, Yueqi
Generative AI models, such as ChatGPT, will increasingly replace humans in producing output for a variety of important tasks. While much prior work has mostly focused on the improvement in the average performance of generative AI models relative to humans' performance, much less attention has been paid to the significant reduction of variance in output produced by generative AI models. In this Perspective, we demonstrate that generative AI models are inherently prone to the phenomenon of "regression toward the mean" whereby variance in output tends to shrink relative to that in real-world distributions. We discuss potential social implications of this phenomenon across three levels-societal, group, and individual-and two dimensions-material and non-material. Finally, we discuss interventions to mitigate negative effects, considering the roles of both service providers and users. Overall, this Perspective aims to raise awareness of the importance of output variance in generative AI and to foster collaborative efforts to meet the challenges posed by the reduction of variance in output generated by AI models.
InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models
Ding, Jing, Feng, Kai, Lin, Binbin, Cai, Jiarui, Wang, Qiushi, Xie, Yu, Zhang, Xiaojin, Wei, Zhongyu, Chen, Wei
The application of large language models (LLMs) has achieved remarkable success in various fields, but their effectiveness in specialized domains like the Chinese insurance industry remains underexplored. The complexity of insurance knowledge, encompassing specialized terminology and diverse data types, poses significant challenges for both models and users. To address this, we introduce InsQABench, a benchmark dataset for the Chinese insurance sector, structured into three categories: Insurance Commonsense Knowledge, Insurance Structured Database, and Insurance Unstructured Documents, reflecting real-world insurance question-answering tasks.We also propose two methods, SQL-ReAct and RAG-ReAct, to tackle challenges in structured and unstructured data tasks. Evaluations show that while LLMs struggle with domain-specific terminology and nuanced clause texts, fine-tuning on InsQABench significantly improves performance. Our benchmark establishes a solid foundation for advancing LLM applications in the insurance domain, with data and code available at InsQABench.
Harnessing the Power of Vibration Motors to Develop Miniature Untethered Robotic Fishes
Jiang, Chongjie, Dai, Yingying, Le, Jinyang, Chen, Xiaomeng, Xie, Yu, Zhou, Wei, Niu, Fuzhou, Li, Ying, Luo, Tao
Miniature underwater robots play a crucial role in the exploration and development of marine resources, particularly in confined spaces and high-pressure deep-sea environments. This study presents the design, optimization, and performance of a miniature robotic fish, powered by the oscillation of bio-inspired fins. These fins feature a rigid-flexible hybrid structure and use an eccentric rotating mass (ERM) vibration motor as the excitation source to generate high-frequency unidirectional oscillations that induce acoustic streaming for propulsion. The drive mechanism, powered by miniature ERM vibration motors, eliminates the need for complex mechanical drive systems, enabling complete isolation of the entire drive system from the external environment and facilitating the miniaturization of the robotic fish. A compact, untethered robotic fish, measuring 85*60*45 mm^3, is equipped with three bio-inspired fins located at the pectoral and caudal positions. Experimental results demonstrate that the robotic fish achieves a maximum forward swimming speed of 1.36 body lengths (BL) per second powered by all fins and minimum turning radius of 0.6 BL when powered by a single fin. These results underscore the significance of employing the ERM vibration motor in advancing the development of highly maneuverable, miniature untethered underwater robots for various marine exploration tasks.
The Social Impact of Generative LLM-Based AI
Xie, Yu, Avila, Sofia
The research was partially supported by the Paul and Marcia Wythes Center on Contemporary China and Office of Population Research at Princeton University. We are grateful to Wen Liu, Gou Wu, and Dean Minello for their excellent research assistance. The ideas expressed herein are those of the authors. Abstract Liking it or not, ready or not, we are likely to enter a new phase of human history in which Artificial Intelligence (AI) will dominate economic production and social life - the AI Revolution. Before the actual arrival of the AI Revolution, it is time for us to speculate on how AI will impact the social world. In this article, we focus on the social impact of generative LLMbased AI (GELLMAI), discussing societal factors that contribute to its technological development and its potential roles in enhancing both between-country and within-country social inequality. There are good indications that the US and China will lead the field and will be the main competitors for domination of AI in the world. We conjecture the AI Revolution will likely give rise to a post-knowledge society in which knowledge per se will become less important than in today's world. Instead, individual relationships and social identity will become more important. With the advent of Generative Large Language Model (LLM)-based Artificial Intelligence (AI) tools such as ChatGPT from OpenAI and Bard from Google, it is natural to wonder about the social impact of this technology. In the remainder of this paper, we will refer to generative LLMbased AI simply as GELLMAI. The main objective of this paper is to explore, tentatively, the social impact of GELLMAI. While the question about the social impact of GELLMAI is undoubtedly important, any answers must be tentative and speculative at this point. We are still in the early stages of GELLMAI and may need to wait years, perhaps even decades, to fully understand its social implications. However, drawing from our experiences with past technologies in history, our current understanding of GELLMAI, empirical knowledge about the social world, and sociological reasoning, we can engage in preliminary and speculative discussions. We offer our account below. We believe that the social impact of GELLMAI is enormous, with the potential to revolutionize not only the production of goods and services but also to fundamentally alter the organization of human societies and the nature of daily life.
Digital Twin Vehicular Edge Computing Network: Task Offloading and Resource Allocation
Xie, Yu, Wu, Qiong, Fan, Pingyi
With the increasing demand for multiple applications on internet of vehicles. It requires vehicles to carry out multiple computing tasks in real time. However, due to the insufficient computing capability of vehicles themselves, offloading tasks to vehicular edge computing (VEC) servers and allocating computing resources to tasks becomes a challenge. In this paper, a multi task digital twin (DT) VEC network is established. By using DT to develop offloading strategies and resource allocation strategies for multiple tasks of each vehicle in a single slot, an optimization problem is constructed. To solve it, we propose a multi-agent reinforcement learning method on the task offloading and resource allocation. Numerous experiments demonstrate that our method is effective compared to other benchmark algorithms.
Resource Allocation for Twin Maintenance and Computing Task Processing in Digital Twin Vehicular Edge Computing Network
Xie, Yu, Wu, Qiong, Fan, Pingyi, Cheng, Nan, Chen, Wen, Wang, Jiangzhou, Letaief, Khaled B.
As a promising technology, vehicular edge computing (VEC) can provide computing and caching services by deploying VEC servers near vehicles. However, VEC networks still face challenges such as high vehicle mobility. Digital twin (DT), an emerging technology, can predict, estimate, and analyze real-time states by digitally modeling objects in the physical world. By integrating DT with VEC, a virtual vehicle DT can be created in the VEC server to monitor the real-time operating status of vehicles. However, maintaining the vehicle DT model requires ongoing attention from the VEC server, which also needs to offer computing services for the vehicles. Therefore, effective allocation and scheduling of VEC server resources are crucial. This study focuses on a general VEC network with a single VEC service and multiple vehicles, examining the two types of delays caused by twin maintenance and computational processing within the network. By transforming the problem using satisfaction functions, we propose an optimization problem aimed at maximizing each vehicle's resource utility to determine the optimal resource allocation strategy. Given the non-convex nature of the issue, we employ multi-agent Markov decision processes to reformulate the problem. Subsequently, we propose the twin maintenance and computing task processing resource collaborative scheduling (MADRL-CSTC) algorithm, which leverages multi-agent deep reinforcement learning. Through experimental comparisons with alternative algorithms, it demonstrates that our proposed approach is effective in terms of resource allocation.
VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models
Liu, Yu, Gao, Lang, Yang, Mingxin, Xie, Yu, Chen, Ping, Zhang, Xiaojin, Chen, Wei
Large Language Models (LLMs) have training corpora containing large amounts of program code, greatly improving the model's code comprehension and generation capabilities. However, sound comprehensive research on detecting program vulnerabilities, a more specific task related to code, and evaluating the performance of LLMs in this more specialized scenario is still lacking. To address common challenges in vulnerability analysis, our study introduces a new benchmark, VulDetectBench, specifically designed to assess the vulnerability detection capabilities of LLMs. The benchmark comprehensively evaluates LLM's ability to identify, classify, and locate vulnerabilities through five tasks of increasing difficulty. We evaluate the performance of 17 models (both open- and closed-source) and find that while existing models can achieve over 80% accuracy on tasks related to vulnerability identification and classification, they still fall short on specific, more detailed vulnerability analysis tasks, with less than 30% accuracy, making it difficult to provide valuable auxiliary information for professional vulnerability mining. Our benchmark effectively evaluates the capabilities of various LLMs at different levels in the specific task of vulnerability detection, providing a foundation for future research and improvements in this critical area of code security. VulDetectBench is publicly available at https://github.com/Sweetaroo/VulDetectBench.
Physics-Informed Statistical Modeling for Wildfire Aerosols Process Using Multi-Source Geostationary Satellite Remote-Sensing Data Streams
Wei, Guanzhou, Krishnan, Venkat, Xie, Yu, Sengupta, Manajit, Zhang, Yingchen, Liao, Haitao, Liu, Xiao
Increasingly frequent wildfires significantly affect solar energy production as the atmospheric aerosols generated by wildfires diminish the incoming solar radiation to the earth. Atmospheric aerosols are measured by Aerosol Optical Depth (AOD), and AOD data streams can be retrieved and monitored by geostationary satellites. However, multi-source remote-sensing data streams often present heterogeneous characteristics, including different data missing rates, measurement errors, systematic biases, and so on. To accurately estimate and predict the underlying AOD propagation process, there exist practical needs and theoretical interests to propose a physics-informed statistical approach for modeling wildfire AOD propagation by simultaneously utilizing, or fusing, multi-source heterogeneous satellite remote-sensing data streams. Leveraging a spectral approach, the proposed approach integrates multi-source satellite data streams with a fundamental advection-diffusion equation that governs the AOD propagation process. A bias correction process is included in the statistical model to account for the bias of the physics model and the truncation error of the Fourier series. The proposed approach is applied to California wildfires AOD data streams obtained from the National Oceanic and Atmospheric Administration. Comprehensive numerical examples are provided to demonstrate the predictive capabilities and model interpretability of the proposed approach. Computer code has been made available on GitHub.