Goto

Collaborating Authors

 dima


Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective

Zhang, Yang, Li, Xinran, Ye, Jianing, Qiu, Shuang, Qu, Delin, Li, Xiu, Zhang, Chongjie, Bai, Chenjia

arXiv.org Artificial Intelligence

World models have recently attracted growing interest in Multi-Agent Reinforcement Learning (MARL) due to their ability to improve sample efficiency for policy learning. However, accurately modeling environments in MARL is challenging due to the exponentially large joint action space and highly uncertain dynamics inherent in multi-agent systems. To address this, we reduce modeling complexity by shifting from jointly modeling the entire state-action transition dynamics to focusing on the state space alone at each timestep through sequential agent modeling. Specifically, our approach enables the model to progressively resolve uncertainty while capturing the structured dependencies among agents, providing a more accurate representation of how agents influence the state. Interestingly, this sequential revelation of agents' actions in a multi-agent system aligns with the reverse process in diffusion models--a class of powerful generative models known for their expressiveness and training stability compared to autoregressive or latent variable models. Leveraging this insight, we develop a flexible and robust world model for MARL using diffusion models. Our method, Diffusion-Inspired Multi-Agent world model (DIMA), achieves state-of-the-art performance across multiple multi-agent control benchmarks, significantly outperforming prior world models in terms of final return and sample efficiency, including MAMuJoCo and Bi-DexHands. DIMA establishes a new paradigm for constructing multi-agent world models, advancing the frontier of MARL research. Codes are open-sourced at https://github.com/breez3young/DIMA.


DiMA: An LLM-Powered Ride-Hailing Assistant at DiDi

Ning, Yansong, Cai, Shuowei, Li, Wei, Fang, Jun, Tan, Naiqiang, Chai, Hua, Liu, Hao

arXiv.org Artificial Intelligence

On-demand ride-hailing services like DiDi, Uber, and Lyft have transformed urban transportation, offering unmatched convenience and flexibility. In this paper, we introduce DiMA, an LLM-powered ride-hailing assistant deployed in DiDi Chuxing. Its goal is to provide seamless ride-hailing services and beyond through a natural and efficient conversational interface under dynamic and complex spatiotemporal urban contexts. To achieve this, we propose a spatiotemporal-aware order planning module that leverages external tools for precise spatiotemporal reasoning and progressive order planning. Additionally, we develop a cost-effective dialogue system that integrates multi-type dialog repliers with cost-aware LLM configurations to handle diverse conversation goals and trade-off response quality and latency. Furthermore, we introduce a continual fine-tuning scheme that utilizes real-world interactions and simulated dialogues to align the assistant's behavior with human preferred decision-making processes. Since its deployment in the DiDi application, DiMA has demonstrated exceptional performance, achieving 93% accuracy in order planning and 92% in response generation during real-world interactions. Offline experiments further validate DiMA capabilities, showing improvements of up to 70.23% in order planning and 321.27% in response generation compared to three state-of-the-art agent frameworks, while reducing latency by $0.72\times$ to $5.47\times$. These results establish DiMA as an effective, efficient, and intelligent mobile assistant for ride-hailing services.


Distilling Multi-modal Large Language Models for Autonomous Driving

Hegde, Deepti, Yasarla, Rajeev, Cai, Hong, Han, Shizhong, Bhattacharyya, Apratim, Mahajan, Shweta, Liu, Litian, Garrepalli, Risheek, Patel, Vishal M., Porikli, Fatih

arXiv.org Artificial Intelligence

Autonomous driving demands safe motion planning, especially in critical "long-tail" scenarios. Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events. However, using LLMs at test time introduces high computational costs. To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM. DiMA distills the information from a multi-modal LLM to a vision-based end-to-end planner through a set of specially designed surrogate tasks. Under a joint training strategy, a scene encoder common to both networks produces structured representations that are semantically grounded as well as aligned to the final planning objective. Notably, the LLM is optional at inference, enabling robust planning without compromising on efficiency. Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios. DiMA also achieves state-of-the-art performance on the nuScenes planning benchmark.


When Generative AI Meets Workplace Learning: Creating A Realistic & Motivating Learning Experience With A Generative PCA

Bucher, Andreas, Schenk, Birgit, Dolata, Mateusz, Schwabe, Gerhard

arXiv.org Artificial Intelligence

Workplace learning is used to train employees systematically, e.g., via e-learning or in 1:1 training. However, this is often deemed ineffective and costly. Whereas pure e-learning lacks the possibility of conversational exercise and personal contact, 1:1 training with human instructors involves a high level of personnel and organizational costs. Hence, pedagogical conversational agents (PCAs), based on generative AI, seem to compensate for the disadvantages of both forms. Following Action Design Research, this paper describes an organizational communication training with a Generative PCA (GenPCA). The evaluation shows promising results: the agent was perceived positively among employees and contributed to an improvement in self-determined learning. However, the integration of such agent comes not without limitations. We conclude with suggestions concerning the didactical methods, which are supported by a GenPCA, and possible improvements of such an agent for workplace learning.


Diffusion on language model embeddings for protein sequence generation

Meshchaninov, Viacheslav, Strashnov, Pavel, Shevtsov, Andrey, Nikolaev, Fedor, Ivanisenko, Nikita, Kardymon, Olga, Vetrov, Dmitry

arXiv.org Artificial Intelligence

Protein design requires a deep understanding of the inherent complexities of the protein universe. While many efforts lean towards conditional generation or focus on specific families of proteins, the foundational task of unconditional generation remains underexplored and undervalued. Here, we explore this pivotal domain, introducing DiMA, a model that leverages continuous diffusion on embeddings derived from the protein language model, ESM-2, to generate amino acid sequences. DiMA surpasses leading solutions, including autoregressive transformer-based and discrete diffusion models, and we quantitatively illustrate the impact of the design choices that lead to its superior performance. We extensively evaluate the quality, diversity, distribution similarity, and biological relevance of the generated sequences using multiple metrics across various modalities. Our approach consistently produces novel, diverse protein sequences that accurately reflect the inherent structural and functional diversity of the protein space. This work advances the field of protein design and sets the stage for conditional models by providing a robust framework for scalable and high-quality protein sequence generation.


How AI can make the metaverse a more interactive space

#artificialintelligence

The potential behind the metaverse is becoming greater as virtual and physical worlds converge. Market intelligence firm Contrive Datum Insights recently found that the global metaverse market is estimated to surpass $1.3 trillion by 2030. According to the study, this growth will be driven by newly adopted virtual economy trends, combined with the rise of both crypto and online games. Additionally, a recent survey conducted by CoinWire highlighted that the metaverse would likely reshape social lifestyles. CoinWire found that 69% of respondents believe that the metaverse will eventually modify social lifestyles due to new approaches taken for entertainment and activities. Hackl elaborated that technologies such as volumetric video -- a technique that offers a more immersive experience by capturing three-dimensional spaces -- will likely change how individuals communicate.


He Said, She Said: Style Transfer for Shifting the Perspective of Dialogues

Bertsch, Amanda, Neubig, Graham, Gormley, Matthew R.

arXiv.org Artificial Intelligence

In this work, we define a new style transfer task: perspective shift, which reframes a dialogue from informal first person to a formal third person rephrasing of the text. This task requires challenging coreference resolution, emotion attribution, and interpretation of informal text. We explore several baseline approaches and discuss further directions on this task when applied to short dialogues. As a sample application, we demonstrate that applying perspective shifting to a dialogue summarization dataset (SAMSum) substantially improves the zero-shot performance of extractive news summarization models on this data. Additionally, supervised extractive models perform better when trained on perspective shifted data than on the original dialogues. We release our code publicly.


Can Smart Earbuds Instantly Translate Foreign Speech?

WSJ.com: WSJD - Technology

STEPPING OFF THE PLANE in Russia for the first time in 2013, I collided with a wall of blunt language and was intrigued beyond repair. Five years, countless classes and ten visits to Moscow later, I still claim a distinctly below-average capacity for the Russian tongue and its dense, foreboding components. To fill these gaps ahead of my next adventure abroad, I turned to technology. Late last year, Brooklyn's Waverly Labs released the Pilot ($299, waverlylabs.com), These eavesdropping devices use a cloud-based machine learning technology to pipe dozens of different languages into your brain in your mother tongue.