Not enough data to create a plot.
Try a different view from the menu above.
Wang, Hao
An Empirical Study of the Impact of Federated Learning on Machine Learning Model Accuracy
Yang, Haotian, Wang, Zhuoran, Chou, Benson, Xu, Sophie, Wang, Hao, Wang, Jingxian, Zhang, Qizhen
Federated Learning (FL) enables distributed ML model training on private user data at the global scale. Despite the potential of FL demonstrated in many domains, an in-depth view of its impact on model accuracy remains unclear. In this paper, we investigate, systematically, how this learning paradigm can affect the accuracy of state-of-the-art ML models for a variety of ML tasks. We present an empirical study that involves various data types: text, image, audio, and video, and FL configuration knobs: data distribution, FL scale, client sampling, and local and global computations. Our experiments are conducted in a unified FL framework to achieve high fidelity, with substantial human efforts and resource investments. Based on the results, we perform a quantitative analysis of the impact of FL, and highlight challenging scenarios where applying FL degrades the accuracy of the model drastically and identify cases where the impact is negligible. The detailed and extensive findings can benefit practical deployments and future development of FL.
GAA-TSO: Geometry-Aware Assisted Depth Completion for Transparent and Specular Objects
Liu, Yizhe, Jia, Tong, Cai, Da, Wang, Hao, Chen, Dongyue
Transparent and specular objects are frequently encountered in daily life, factories, and laboratories. However, due to the unique optical properties, the depth information on these objects is usually incomplete and inaccurate, which poses significant challenges for downstream robotics tasks. Therefore, it is crucial to accurately restore the depth information of transparent and specular objects. Previous depth completion methods for these objects usually use RGB information as an additional channel of the depth image to perform depth prediction. Due to the poor-texture characteristics of transparent and specular objects, these methods that rely heavily on color information tend to generate structure-less depth predictions. Moreover, these 2D methods cannot effectively explore the 3D structure hidden in the depth channel, resulting in depth ambiguity. To this end, we propose a geometry-aware assisted depth completion method for transparent and specular objects, which focuses on exploring the 3D structural cues of the scene. Specifically, besides extracting 2D features from RGB-D input, we back-project the input depth to a point cloud and build the 3D branch to extract hierarchical scene-level 3D structural features. To exploit 3D geometric information, we design several gated cross-modal fusion modules to effectively propagate multi-level 3D geometric features to the image branch. In addition, we propose an adaptive correlation aggregation strategy to appropriately assign 3D features to the corresponding 2D features. Extensive experiments on ClearGrasp, OOD, TransCG, and STD datasets show that our method outperforms other state-of-the-art methods. We further demonstrate that our method significantly enhances the performance of downstream robotic grasping tasks.
Training Video Foundation Models with NVIDIA NeMo
Patel, Zeeshan, He, Ethan, Mannan, Parth, Ren, Xiaowei, Wolf, Ryan, Agarwal, Niket, Huffman, Jacob, Wang, Zhuoyao, Wang, Carl, Chang, Jack, Bai, Yan, Huang, Tommy, Wang, Linnan, Jain, Sahil, Ramasamy, Shanmugam, Jennings, Joseph, Sirazitdinova, Ekaterina, Sudakov, Oleg, Ma, Mingyuan, Chen, Bobby, Lin, Forrest, Wang, Hao, Sabavat, Vasanth Rao Naik, Niverty, Sriharsha, Ou, Rong, Bhattacharya, Pallab, Page, David, Tajbakhsh, Nima, Aithal, Ashwath
Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.
Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation
Chen, Xiwen, Zhu, Wenhui, Qiu, Peijie, Wang, Hao, Li, Huayu, Wu, Haiyu, Sotiras, Aristeidis, Wang, Yalin, Razi, Abolfazl
Vision-language models (VLMs) such as CLIP demonstrate strong performance but struggle when adapted to downstream tasks. Prompt learning has emerged as an efficient and effective strategy to adapt VLMs while preserving their pre-trained knowledge. However, existing methods still lead to overfitting and degrade zero-shot generalization. To address this challenge, we propose an optimal transport (OT)-guided prompt learning framework that mitigates forgetting by preserving the structural consistency of feature distributions between pre-trained and fine-tuned models. Unlike conventional point-wise constraints, OT naturally captures cross-instance relationships and expands the feasible parameter space for prompt tuning, allowing a better trade-off between adaptation and generalization. Our approach enforces joint constraints on both vision and text representations, ensuring a holistic feature alignment. Extensive experiments on benchmark datasets demonstrate that our simple yet effective method can outperform existing prompt learning strategies in base-to-novel generalization, cross-dataset evaluation, and domain generalization without additional augmentation or ensemble techniques. The code is available at https://github.com/ChongQingNoSubway/Prompt-OT
RASD: Retrieval-Augmented Speculative Decoding
Quan, Guofeng, Feng, Wenfeng, Hao, Chuzhan, Jiang, Guochao, Zhang, Yuewei, Wang, Hao
Speculative decoding accelerates inference in large language models (LLMs) by generating draft tokens for target model verification. Current approaches for obtaining draft tokens rely on lightweight draft models or additional model structures to generate draft tokens and retrieve context from databases. Due to the draft model's small size and limited training data, model-based speculative decoding frequently becomes less effective in out-of-domain scenarios. Additionally, the time cost of the drafting phase results in a low upper limit on acceptance length during the verification step, limiting overall efficiency. This paper proposes RASD (Retrieval-Augmented Speculative Decoding), which adopts retrieval methods to enhance model-based speculative decoding. We introduce tree pruning and tree fusion to achieve this. Specifically, we develop a pruning method based on the draft model's probability distribution to construct the optimal retrieval tree. Second, we employ the longest prefix matching algorithm to merge the tree generated by the draft model with the retrieval tree, resulting in a unified tree for verification. Experimental results demonstrate that RASD achieves state-of-the-art inference acceleration across tasks such as DocQA, Summary, Code, and In-Domain QA. Moreover, RASD exhibits strong scalability, seamlessly integrating with various speculative decoding approaches, including both generation-based and retrieval-based methods.
Effectively Steer LLM To Follow Preference via Building Confident Directions
Song, Bingqing, Han, Boran, Zhang, Shuai, Wang, Hao, Fang, Haoyang, Min, Bonan, Wang, Yuyang, Hong, Mingyi
Having an LLM that aligns with human preferences is essential for accommodating individual needs, such as maintaining writing style or generating specific topics of interest. The majority of current alignment methods rely on fine-tuning or prompting, which can be either costly or difficult to control. Model steering algorithms, which modify the model output by constructing specific steering directions, are typically easy to implement and optimization-free. However, their capabilities are typically limited to steering the model into one of the two directions (i.e., bidirectional steering), and there has been no theoretical understanding to guarantee their performance. In this work, we propose a theoretical framework to understand and quantify the model steering methods. Inspired by the framework, we propose a confident direction steering method (CONFST) that steers LLMs via modifying their activations at inference time. More specifically, CONFST builds a confident direction that is closely aligned with users' preferences, and this direction is then added to the activations of the LLMs to effectively steer the model output. Our approach offers three key advantages over popular bidirectional model steering methods: 1) It is more powerful, since multiple (i.e. more than two) users' preferences can be aligned simultaneously; 2) It is simple to implement, since there is no need to determine which layer to add the steering vector to; 3) No explicit user instruction is required. We validate our method on GPT-2 XL (1.5B), Mistral (7B) and Gemma-it (9B) models for tasks that require shifting the output of LLMs across various topics and styles, achieving superior performance over competing methods.
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
Team, M-A-P, Du, Xinrun, Yao, Yifan, Ma, Kaijing, Wang, Bingli, Zheng, Tianyu, Zhu, Kang, Liu, Minghao, Liang, Yiming, Jin, Xiaolong, Wei, Zhenlin, Zheng, Chujie, Deng, Kaixin, Jia, Shian, Jiang, Sichao, Liao, Yiyan, Li, Rui, Li, Qinrui, Li, Sirun, Li, Yizhi, Li, Yunwen, Ma, Dehua, Ni, Yuansheng, Que, Haoran, Wang, Qiyao, Wen, Zhoufutu, Wu, Siwei, Xing, Tianshun, Xu, Ming, Yang, Zhenzhu, Wang, Zekun Moore, Zhou, Junting, Bai, Yuelin, Bu, Xingyuan, Cai, Chenglin, Chen, Liang, Chen, Yifan, Cheng, Chengtuo, Cheng, Tianhao, Ding, Keyi, Huang, Siming, Huang, Yun, Li, Yaoru, Li, Yizhe, Li, Zhaoqun, Liang, Tianhao, Lin, Chengdong, Lin, Hongquan, Ma, Yinghao, Pang, Tianyang, Peng, Zhongyuan, Peng, Zifan, Qi, Qige, Qiu, Shi, Qu, Xingwei, Quan, Shanghaoran, Tan, Yizhou, Wang, Zili, Wang, Chenqing, Wang, Hao, Wang, Yiya, Wang, Yubo, Xu, Jiajun, Yang, Kexin, Yuan, Ruibin, Yue, Yuanhao, Zhan, Tianyang, Zhang, Chun, Zhang, Jinyang, Zhang, Xiyue, Zhang, Xingjian, Zhang, Yue, Zhao, Yongchi, Zheng, Xiangyu, Zhong, Chenghua, Gao, Yang, Li, Zhoujun, Liu, Dayiheng, Liu, Qian, Liu, Tianyu, Ni, Shiwen, Peng, Junran, Qin, Yujia, Su, Wenbo, Wang, Guoyin, Wang, Shi, Yang, Jian, Yang, Min, Cao, Meng, Yue, Xiang, Zhang, Zhaoxiang, Zhou, Wangchunshu, Liu, Jiaheng, Lin, Qunshu, Huang, Wenhao, Zhang, Ge
Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.
Towards Widening The Distillation Bottleneck for Reasoning Models
Yin, Huifeng, Zhao, Yu, Wu, Minghao, Ni, Xuanfan, Zeng, Bo, Wang, Hao, Shi, Tianqi, Shao, Liangying, Lyu, Chenyang, Wang, Longyue, Luo, Weihua, Zhang, Kaifu
Large Reasoning Models(LRMs) such as OpenAI o1 and DeepSeek-R1 have shown remarkable reasoning capabilities by scaling test-time compute and generating long Chain-of-Thought(CoT). Distillation--post-training on LRMs-generated data--is a straightforward yet effective method to enhance the reasoning abilities of smaller models, but faces a critical bottleneck: we found that distilled long CoT data poses learning difficulty for small models and leads to the inheritance of biases (i.e. over-thinking) when using Supervised Fine-tuning(SFT) and Reinforcement Learning(RL) methods. To alleviate this bottleneck, we propose constructing tree-based CoT data from scratch via Monte Carlo Tree Search(MCTS). We then exploit a set of CoT-aware approaches, including Thoughts Length Balance, Fine-grained DPO, and Joint Post-training Objective, to enhance SFT and RL on the construted data.
RAPID: Efficient Retrieval-Augmented Long Text Generation with Writing Planning and Information Discovery
Gu, Hongchao, Li, Dexun, Dong, Kuicai, Zhang, Hao, Lv, Hang, Wang, Hao, Lian, Defu, Liu, Yong, Chen, Enhong
Generating knowledge-intensive and comprehensive long texts, such as encyclopedia articles, remains significant challenges for Large Language Models. It requires not only the precise integration of facts but also the maintenance of thematic coherence throughout the article. Existing methods, such as direct generation and multi-agent discussion, often struggle with issues like hallucinations, topic incoherence, and significant latency. To address these challenges, we propose RAPID, an efficient retrieval-augmented long text generation framework. RAPID consists of three main modules: (1) Retrieval-augmented preliminary outline generation to reduce hallucinations, (2) Attribute-constrained search for efficient information discovery, (3) Plan-guided article generation for enhanced coherence. Extensive experiments on our newly compiled benchmark dataset, FreshWiki-2024, demonstrate that RAPID significantly outperforms state-of-the-art methods across a wide range of evaluation metrics (e.g. long-text generation, outline quality, latency, etc). Our work provides a robust and efficient solution to the challenges of automated long-text generation.
InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation
Zhang, Chong, Ma, Yukun, Chen, Qian, Wang, Wen, Zhao, Shengkui, Pan, Zexu, Wang, Hao, Ni, Chongjia, Nguyen, Trung Hieu, Zhou, Kun, Jiang, Yidi, Tan, Chaohong, Gao, Zhifu, Du, Zhihao, Ma, Bin
We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sampling rate from both text and audio prompts. Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information, thereby reducing training costs and enhancing efficiency. This combination enables us to achieve high-quality audio generation with long-form coherence of up to $8$ minutes. Then, an autoregressive transformer model based on Qwen 2.5 predicts audio tokens. Next, we employ a super-resolution flow-matching model to generate high-sampling rate audio with fine-grained details learned from an acoustic codec model. Comprehensive experiments show that the InspireMusic-1.5B-Long model has a comparable performance to recent top-tier open-source systems, including MusicGen and Stable Audio 2.0, on subjective and objective evaluations. The code and pre-trained models are released at https://github.com/FunAudioLLM/InspireMusic.