Goto

Collaborating Authors

 Xu, Gang


LongCat-Flash-Omni Technical Report

Meituan LongCat Team, null, Wang, Bairui, Bayan, null, Xiao, Bin, Zhang, Bo, Rong, Bolin, Chen, Borun, Wan, Chang, Zhang, Chao, Huang, Chen, Chen, Chen, Chen, Chen, Yang, Chengxu, Yang, Chengzuo, Han, Cong, Peng, Dandan, Ruan, Delian, Xin, Detai, Wang, Disong, Yang, Dongchao, Liu, Fanfan, Chen, Fengjiao, Yang, Fengyu, Dong, Gan, Huang, Gang, Xu, Gang, Wan, Guanglu, Tan, Guoqiang, Yu, Guoqiao, Qiu, Haibo, Lu, Hao, Liu, Hongbo, Xiang, Hongyu, Wu, Jiaheng, Yang, Jian, Liu, Jiaxing, Huang, Jing, Wang, Jingang, Ding, Jinrui, Jiang, Juchao, Kuang, Jun, Wang, Jun, Mei, Junhui, Ding, Ke, Zhang, Kefeng, Chen, Lei, Shi, Liang, Qiao, Limeng, Zheng, Liming, Ma, Lin, Guo, Liuyang, Ma, Liya, Sun, Luying, Gao, Man, Zhu, Mengshen, Cao, Miao, Lin, Minliang, Xu, Nuo, Shi, Peng, Zhang, Qi, Fang, Qian, Wang, Qian, Yang, Qian, Wang, Quanxiu, Weng, Rongxiang, Guo, Rongxin, Liang, Ruoxuan, Yang, Senbin, Xu, Shanbo, Lei, Shanglin, Ye, Shengze, Chen, Shimin, Chen, Shuaiqi, Hu, Shujie, Li, Shuo, Yang, Siqi, Xu, Siyu, Ren, Siyu, Li, Song, Liu, Songxiang, Bai, Tianhao, Dai, Tianye, Hong, Wei, Wang, Wei, Zhao, Weixiao, Cao, Wengang, Zhu, Wenlong, He, Wenlong, Su, Xi, Nan, Xi, Zhao, Xiaohan, Wang, Xiaohao, Zhao, Xiaoyu, Wang, Xiaoyu, Li, Xiaoyu, Pan, Xin, Chen, Xin, Sun, Xiusong, Xiang, Xu, Xing, Xudong, Cao, Xuezhi, Cai, Xunliang, Yang, Yang, Tan, Yanli, Yao, Yao, Sun, Yerui, Chen, Yi, Lu, Yifan, Gong, Yin, Zhang, Yining, Chen, Yitian, Gan, Yiyang, Tang, Yuchen, Xie, Yuchen, Wang, Yueqian, Zheng, Yuewen, Zhang, Yufei, Zhong, Yufeng, Qian, Yulei, Peng, Yuqi, Li, Yuqian, Jiang, Yuwei, Hu, Zeyang, Zhang, Zheng, Tian, Zhengkun, Hong, Zhiqing, Zeng, Zhixiong, Mi, Zhuqi, Li, Ziran, Wang, Ziwen, Zhao, Ziyi, Zhuang, Ziyuan, Zhao, Zizhe

arXiv.org Artificial Intelligence

We introduce LongCat-Flash-Omni, a state-of-the-art open-source omni-modal model with 560 billion parameters, excelling at real-time audio-visual interaction. By adopting a curriculum-inspired progressive training strategy that transitions from simpler to increasingly complex modality sequence modeling tasks, LongCat-Flash-Omni attains comprehensive multimodal capabilities while maintaining strong unimodal capability. Building upon LongCat-Flash, which adopts a high-performance Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, LongCat-Flash-Omni integrates efficient multimodal perception and speech reconstruction modules. Despite its immense size of 560B parameters (with 27B activated), LongCat-Flash-Omni achieves low-latency real-time audio-visual interaction. For training infrastructure, we developed a modality-decoupled parallelism scheme specifically designed to manage the data and model heterogeneity inherent in large-scale multimodal training. This innovative approach demonstrates exceptional efficiency by sustaining over 90% of the throughput achieved by text-only training. Extensive evaluations show that LongCat-Flash-Omni achieves state-of-the-art performance on omni-modal benchmarks among open-source models. Furthermore, it delivers highly competitive results across a wide range of modality-specific tasks, including text, image, and video understanding, as well as audio understanding and generation. We provide a comprehensive overview of the model architecture design, training procedures, and data strategies, and open-source the model to foster future research and development in the community.


Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion

Lu, Keyang, Zhou, Sifan, Xu, Hongbin, Xu, Gang, Yang, Zhifei, Wang, Yikai, Xiao, Zhen, Long, Jieyi, Li, Ming

arXiv.org Artificial Intelligence

Realistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualize the city through a top-down planning strategy that defines a hierarchical "City-District-Grid" structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a "produce-refine-evaluate" isometric image synthesis loop, followed by image-to-3D generation. To simulate continuous city evolution, Yo'City further introduces a user-interactive, relationship-guided expansion mechanism, which performs scene graph-based distance- and semantics-aware layout optimization, ensuring spatially coherent city growth. To comprehensively evaluate our method, we construct a diverse benchmark dataset and design six multi-dimensional metrics that assess generation quality from the perspectives of semantics, geometry, texture, and layout. Extensive experiments demonstrate that Yo'City consistently outperforms existing state-of-the-art methods across all evaluation aspects.


Introducing LongCat-Flash-Thinking: A Technical Report

Meituan LongCat Team, null, Gui, Anchun, Li, Bei, Tao, Bingyang, Zhou, Bole, Chen, Borun, Zhang, Chao, Zhang, Chao, Han, Chengcheng, Yang, Chenhui, Zhang, Chi, Peng, Chong, Zhang, Chuyu, Chen, Cong, Li, Fengcun, Xu, Gang, Lin, Guoyuan, Jiang, Hao, Liang, Hao, Fu, Haomin, Ma, Haoxiang, Liu, Hong, Hao, Hongyan, Tang, Hongyin, Zang, Hongyu, Ni, Hongzhi, Su, Hui, Liu, Jiahao, Li, Jiahuan, Liu, Jialin, Zhang, Jianfei, Xu, Jianhao, Wang, Jianing, Sun, Jiaqi, Zhang, Jiaqi, Shi, Jiarong, Yang, Jiawei, Wang, Jingang, Ding, Jinrui, Kuang, Jun, Xu, Jun, He, Ke, Zhang, Kefeng, Wang, Keheng, He, Keqing, Wei, Li, Shi, Liang, Qiu, Lin, Kong, Lingbin, Liu, Lingchuan, Guo, Linsen, An, Longfei, Xia, Mai, Zhou, Meng, Zhu, Mengshen, Pei, Peng, Jia, Pengcheng, Gu, Qi, Guo, Qi, Huang, Qiong, Chen, Quan, Weng, Quanchi, Weng, Rongxiang, Shao, Ruichen, Li, Rumei, Lei, Shanglin, Du, Shuai, Liu, Shuaikang, Zhou, Shuang, Hu, Shuhao, Xu, Siyu, Gong, Songshan, Liang, Tao, Hu, Tianhao, He, Wei, Shi, Wei, Wang, Wei, Wu, Wei, Zhuo, Wei, Tang, Weifeng, Shi, Wenjie, Zhu, Wenlong, Su, Xi, Liu, Xiangcheng, Xi, Xiangyu, Huang, Xiangzhou, Liu, Xiao, Jiang, Xiaochen, Shi, Xiaowei, Shi, Xiaowen, Li, Xiaoyu, Chen, Xin, Zhao, Xinyue, Huang, Xuan, Zhang, Xuemiao, Cao, Xuezhi, Cai, Xunliang, Zhang, Yajie, Chen, Yang, Liu, Yang, Liu, Yang, Zheng, Yang, Wang, Yaoming, Huo, Yaqi, Sun, Yerui, Lu, Yifan, Li, Yiyang, Xiao, Youshao, Lei, Yuanzhe, Xie, Yuchen, Sun, Yueqing, Zhang, Yufei, Wei, Yuhuai, Qian, Yulei, Zhao, Yunke, Ding, Yuqing, Jiang, Yuwei, Yang, Zhaohua, Chen, Zhengyu, Liu, Zhijian, Xia, Zhikang, Su, Zhongda, Li, Ziran, Wang, Ziwen, Zhuang, Ziyuan, Wang, Zongyu, Yang, Zunyuan

arXiv.org Artificial Intelligence

We present LongCat-Flash-Thinking, an efficient 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model. Its advanced capabilities are cultivated through a meticulously crafted training process, beginning with long Chain-of-Thought (CoT) data cold-start and culminating in large-scale Reinforcement Learning (RL). We first employ a well-designed cold-start training strategy, which significantly enhances the reasoning potential and equips the model with specialized skills in both formal and agentic reasoning. Then, a core innovation is our domain-parallel training scheme, which decouples optimization across distinct domains (e.g., STEM, Code, Agentic) and subsequently fuses the resulting expert models into a single, nearly Pareto-optimal model. This entire process is powered by our Dynamic ORchestration for Asynchronous rollout (DORA) system, a large-scale RL framework that delivers a greater than threefold training speedup over synchronous methods on tens of thousands of accelerators. As a result, LongCat-Flash-Thinking achieves state-of-the-art performance among open-source models on a suite of complex reasoning tasks. The model exhibits exceptional efficiency in agentic reasoning, reducing average token consumption by 64.5% (from 19, 653 to 6, 965) on AIME-25, without degrading task accuracy. We release LongCat-Flash-Thinking to promote further advances in reasoning systems and agentic AI research.


LongCat-Flash Technical Report

Meituan LongCat Team, null, Bayan, null, Li, Bei, Lei, Bingye, Wang, Bo, Rong, Bolin, Wang, Chao, Zhang, Chao, Gao, Chen, Zhang, Chen, Sun, Cheng, Han, Chengcheng, Xi, Chenguang, Zhang, Chi, Peng, Chong, Qin, Chuan, Zhang, Chuyu, Chen, Cong, Wang, Congkui, Ma, Dan, Pan, Daoru, Bu, Defei, Zhao, Dengchang, Kong, Deyang, Liu, Dishan, Huo, Feiye, Li, Fengcun, Zhang, Fubao, Dong, Gan, Liu, Gang, Xu, Gang, Li, Ge, Tan, Guoqiang, Lin, Guoyuan, Jing, Haihang, Fu, Haomin, Yan, Haonan, Wen, Haoxing, Zhao, Haozhe, Liu, Hong, Shi, Hongmei, Hao, Hongyan, Tang, Hongyin, Lv, Huantian, Su, Hui, Li, Jiacheng, Liu, Jiahao, Li, Jiahuan, Yang, Jiajun, Wang, Jiaming, Yang, Jian, Tan, Jianchao, Sun, Jiaqi, Zhang, Jiaqi, Fu, Jiawei, Yang, Jiawei, Hu, Jiaxi, Qin, Jiayu, Wang, Jingang, He, Jiyuan, Kuang, Jun, Mei, Junhui, Liang, Kai, He, Ke, Zhang, Kefeng, Wang, Keheng, He, Keqing, Gao, Liang, Shi, Liang, Ma, Lianhui, Qiu, Lin, Kong, Lingbin, Si, Lingtong, Lyu, Linkun, Guo, Linsen, Yang, Liqi, Yan, Lizhi, Xia, Mai, Gao, Man, Zhang, Manyuan, Zhou, Meng, Shen, Mengxia, Tuo, Mingxiang, Zhu, Mingyang, Li, Peiguang, Pei, Peng, Zhao, Peng, Jia, Pengcheng, Sun, Pingwei, Gu, Qi, Li, Qianyun, Li, Qingyuan, Huang, Qiong, Duan, Qiyuan, Meng, Ran, Weng, Rongxiang, Shao, Ruichen, Li, Rumei, Wu, Shizhe, Liang, Shuai, Wang, Shuo, Dang, Suogui, Fang, Tao, Li, Tao, Chen, Tefeng, Bai, Tianhao, Zhou, Tianhao, Xie, Tingwen, He, Wei, Huang, Wei, Liu, Wei, Shi, Wei, Wang, Wei, Wu, Wei, Zhao, Weikang, Zan, Wen, Shi, Wenjie, Nan, Xi, Su, Xi, Li, Xiang, Mei, Xiang, Ji, Xiangyang, Xi, Xiangyu, Huang, Xiangzhou, Li, Xianpeng, Fu, Xiao, Liu, Xiao, Wei, Xiao, Cai, Xiaodong, Chen, Xiaolong, Liu, Xiaoqing, Li, Xiaotong, Shi, Xiaowei, Li, Xiaoyu, Wang, Xili, Chen, Xin, Hu, Xing, Miao, Xingyu, He, Xinyan, Zhang, Xuemiao, Hao, Xueyuan, Cao, Xuezhi, Cai, Xunliang, Yang, Xurui, Feng, Yan, Bai, Yang, Chen, Yang, Yang, Yang, Huo, Yaqi, Sun, Yerui, Lu, Yifan, Zhang, Yifan, Zang, Yipeng, Zhai, Yitao, Li, Yiyang, Yin, Yongjing, Lv, Yongkang, Zhou, Yongwei, Yang, Yu, Xie, Yuchen, Sun, Yueqing, Zheng, Yuewen, Wei, Yuhuai, Qian, Yulei, Liang, Yunfan, Tai, Yunfang, Zhao, Yunke, Yu, Zeyang, Zhang, Zhao, Yang, Zhaohua, Zhang, Zhenchao, Xia, Zhikang, Zou, Zhiye, Zeng, Zhizhao, Su, Zhongda, Chen, Zhuofan, Zhang, Zijian, Wang, Ziwen, Jiang, Zixu, Zhao, Zizhe, Wang, Zongyu, Su, Zunhai

arXiv.org Artificial Intelligence

We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B-31.3B (27B on average) per token depending on contextual demands, optimizing resource usage. (b) Shortcut-connected MoE, which enlarges the computation-communication overlap window, demonstrating notable gains in inference efficiency and throughput compared to models of a comparable scale. We develop a comprehensive scaling framework for large models that combines hyperparameter transfer, model-growth initialization, a multi-pronged stability suite, and deterministic computation to achieve stable and reproducible training. Notably, leveraging the synergy among scalable architectural design and infrastructure efforts, we complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of \$0.70 per million output tokens. To cultivate LongCat-Flash towards agentic intelligence, we conduct a large-scale pre-training on optimized mixtures, followed by targeted mid- and post-training on reasoning, code, and instructions, with further augmentation from synthetic data and tool use tasks. Comprehensive evaluations demonstrate that, as a non-thinking foundation model, LongCat-Flash delivers highly competitive performance among other leading models, with exceptional strengths in agentic tasks. The model checkpoint of LongCat-Flash is open-sourced to foster community research. LongCat Chat: https://longcat.ai Hugging Face: https://huggingface.co/meituan-longcat GitHub: https://github.com/meituan-longcat


TextSplat: Text-Guided Semantic Fusion for Generalizable Gaussian Splatting

Wu, Zhicong, Xu, Hongbin, Xu, Gang, Nie, Ping, Yan, Zhixin, Zheng, Jinkai, Qu, Liangqiong, Li, Ming, Nie, Liqiang

arXiv.org Artificial Intelligence

Recent advancements in Generalizable Gaussian Splatting have enabled robust 3D reconstruction from sparse input views by utilizing feed-forward Gaussian Splatting models, achieving superior cross-scene generalization. However, while many methods focus on geometric consistency, they often neglect the potential of text-driven guidance to enhance semantic understanding, which is crucial for accurately reconstructing fine-grained details in complex scenes. To address this limitation, we propose TextSplat--the first text-driven Generalizable Gaussian Splatting framework. By employing a text-guided fusion of diverse semantic cues, our framework learns robust cross-modal feature representations that improve the alignment of geometric and semantic information, producing high-fidelity 3D reconstructions. Specifically, our framework employs three parallel modules to obtain complementary representations: the Diffusion Prior Depth Estimator for accurate depth information, the Semantic Aware Segmentation Network for detailed semantic information, and the Multi-View Interaction Network for refined cross-view features. Then, in the Text-Guided Semantic Fusion Module, these representations are integrated via the text-guided and attention-based feature aggregation mechanism, resulting in enhanced 3D Gaussian parameters enriched with detailed semantic cues. Experimental results on various benchmark datasets demonstrate improved performance compared to existing methods across multiple evaluation metrics, validating the effectiveness of our framework. The code will be publicly available.


StyleTailor: Towards Personalized Fashion Styling via Hierarchical Negative Feedback

Ma, Hongbo, Shen, Fei, Xu, Hongbin, Wang, Xiaoce, Xu, Gang, Zheng, Jinkai, Qu, Liangqiong, Li, Ming

arXiv.org Artificial Intelligence

The advancement of intelligent agents has revolutionized problem-solving across diverse domains, yet solutions for personalized fashion styling remain underexplored, which holds immense promise for promoting shopping experiences. In this work, we present StyleTailor, the first collaborative agent framework that seamlessly unifies personalized apparel design, shopping recommendation, virtual try-on, and systematic evaluation into a cohesive workflow. To this end, StyleTailor pioneers an iterative visual refinement paradigm driven by multi-level negative feedback, enabling adaptive and precise user alignment. Specifically, our framework features two core agents, i.e., Designer for personalized garment selection and Consultant for virtual try-on, whose outputs are progressively refined via hierarchical vision-language model feedback spanning individual items, complete outfits, and try-on efficacy. Counterexamples are aggregated into negative prompts, forming a closed-loop mechanism that enhances recommendation quality. To assess the performance, we introduce a comprehensive evaluation suite encompassing style consistency, visual quality, face similarity, and artistic appraisal. Extensive experiments demonstrate StyleTailor's superior performance in delivering personalized designs and recommendations, outperforming strong baselines without negative feedback and establishing a new benchmark for intelligent fashion systems.


FNBench: Benchmarking Robust Federated Learning against Noisy Labels

Jiang, Xuefeng, Li, Jia, Wu, Nannan, Wu, Zhiyuan, Li, Xujing, Sun, Sheng, Xu, Gang, Wang, Yuwei, Li, Qi, Liu, Min

arXiv.org Artificial Intelligence

Robustness to label noise within data is a significant challenge in federated learning (FL). From the data-centric perspective, the data quality of distributed datasets can not be guaranteed since annotations of different clients contain complicated label noise of varying degrees, which causes the performance degradation. There have been some early attempts to tackle noisy labels in FL. However, there exists a lack of benchmark studies on comprehensively evaluating their practical performance under unified settings. To this end, we propose the first benchmark study FNBench to provide an experimental investigation which considers three diverse label noise patterns covering synthetic label noise, imperfect human-annotation errors and systematic errors. Our evaluation incorporates eighteen state-of-the-art methods over five image recognition datasets and one text classification dataset. Meanwhile, we provide observations to understand why noisy labels impair FL, and additionally exploit a representation-aware regularization method to enhance the robustness of existing methods against noisy labels based on our observations. Finally, we discuss the limitations of this work and propose three-fold future directions. To facilitate related communities, our source code is open-sourced at https://github.com/Sprinter1999/FNBench.


PVChat: Personalized Video Chat with One-Shot Learning

Shi, Yufei, Yan, Weilong, Xu, Gang, Li, Yumeng, Li, Yuchen, Li, Zhenxi, Yu, Fei Richard, Li, Ming, Yeo, Si Yong

arXiv.org Artificial Intelligence

Video large language models (ViLLMs) excel in general video understanding, e.g., recognizing activities like talking and eating, but struggle with identity-aware comprehension, such as "Wilson is receiving chemotherapy" or "Tom is discussing with Sarah", limiting their applicability in smart healthcare and smart home environments. To address this limitation, we propose a one-shot learning framework PVChat, the first personalized ViLLM that enables subject-aware question answering (QA) from a single video for each subject. Our approach optimizes a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically augmented video-QA dataset, leveraging a progressive image-to-video learning strategy. Specifically, we introduce an automated augmentation pipeline that synthesizes identity-preserving positive samples and retrieves hard negatives from existing video corpora, generating a diverse training dataset with four QA types: existence, appearance, action, and location inquiries. To enhance subject-specific learning, we propose a ReLU Routing MoH attention mechanism, alongside two novel objectives: (1) Smooth Proximity Regularization for progressive learning through exponential distance scaling and (2) Head Activation Enhancement for balanced attention routing. Finally, we adopt a two-stage training strategy, transitioning from image pre-training to video fine-tuning, enabling a gradual learning process from static attributes to dynamic representations. We evaluate PVChat on diverse datasets covering medical scenarios, TV series, anime, and real-world footage, demonstrating its superiority in personalized feature understanding after learning from a single video, compared to state-of-the-art ViLLMs.


Inter3D: A Benchmark and Strong Baseline for Human-Interactive 3D Object Reconstruction

Chen, Gan, He, Ying, Yu, Mulin, Yu, F. Richard, Xu, Gang, Ma, Fei, Li, Ming, Zhou, Guang

arXiv.org Artificial Intelligence

Recent advancements in implicit 3D reconstruction methods, e.g., neural rendering fields and Gaussian splatting, have primarily focused on novel view synthesis of static or dynamic objects with continuous motion states. However, these approaches struggle to efficiently model a human-interactive object with n movable parts, requiring 2^n separate models to represent all discrete states. To overcome this limitation, we propose Inter3D, a new benchmark and approach for novel state synthesis of human-interactive objects. We introduce a self-collected dataset featuring commonly encountered interactive objects and a new evaluation pipeline, where only individual part states are observed during training, while part combination states remain unseen. We also propose a strong baseline approach that leverages Space Discrepancy Tensors to efficiently modelling all states of an object. To alleviate the impractical constraints on camera trajectories across training states, we propose a Mutual State Regularization mechanism to enhance the spatial density consistency of movable parts. In addition, we explore two occupancy grid sampling strategies to facilitate training efficiency. We conduct extensive experiments on the proposed benchmark, showcasing the challenges of the task and the superiority of our approach.


Prediction-Enhanced Monte Carlo: A Machine Learning View on Control Variate

Li, Fengpei, Chen, Haoxian, Lin, Jiahe, Gupta, Arkin, Tan, Xiaowei, Xu, Gang, Nevmyvaka, Yuriy, Capponi, Agostino, Lam, Henry

arXiv.org Machine Learning

Despite being an essential tool across engineering and finance, Monte Carlo simulation can be computationally intensive, especially in largescale, path-dependent problems that hinder straightforward parallelization. A natural alternative is to replace simulation with machine learning or surrogate prediction, though this introduces challenges in understanding the resulting errors. We introduce a Prediction-Enhanced Monte Carlo (PEMC) framework where we leverage machine learning prediction as control variates, thus maintaining unbiased evaluations instead of the direct use of ML predictors. Traditional control variate methods require knowledge of means and focus on per-sample variance reduction. In contrast, PEMC aims at overall cost-aware variance reduction, eliminating the need for mean knowledge. PEMC leverages pre-trained neural architectures to construct effective control variates and replaces computationally expensive sample-path generation with efficient neural network evaluations. This allows PEMC to address scenarios where no good control variates are known.