AITopics | Yan, Junchi

Collaborating Authors

Yan, Junchi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

On the Cone Effect in the Learning Dynamics

Zhou, Zhanpeng, Yang, Yongyi, Ren, Jie, Sugiyama, Mahito, Yan, Junchi

arXiv.org Artificial IntelligenceMar-20-2025

Understanding the learning dynamics of neural networks is a central topic in the deep learning community. In this paper, we take an empirical perspective to study the learning dynamics of neural networks in real-world settings. Specifically, we investigate the evolution process of the empirical Neural Tangent Kernel (eNTK) during training. Our key findings reveal a two-phase learning process: i) in Phase I, the eNTK evolves significantly, signaling the rich regime, and ii) in Phase II, the eNTK keeps evolving but is constrained in a narrow space, a phenomenon we term the cone effect. This two-phase framework builds on the hypothesis proposed by Fort et al. (2020), but we uniquely identify the cone effect in Phase II, demonstrating its significant performance advantages over fully linearized training.

artificial intelligence, machine learning, neural network, (16 more...)

arXiv.org Artificial Intelligence

2503.16316

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

AgiBot-World-Contributors, null, Bu, Qingwen, Cai, Jisong, Chen, Li, Cui, Xiuqi, Ding, Yan, Feng, Siyuan, Gao, Shenyuan, He, Xindong, Huang, Xu, Jiang, Shu, Jiang, Yuxin, Jing, Cheng, Li, Hongyang, Li, Jialu, Liu, Chiming, Liu, Yi, Lu, Yuxiang, Luo, Jianlan, Luo, Ping, Mu, Yao, Niu, Yuehan, Pan, Yixuan, Pang, Jiangmiao, Qiao, Yu, Ren, Guanghui, Ruan, Cheng, Shan, Jiaqi, Shen, Yongjian, Shi, Chengshi, Shi, Mingkang, Shi, Modi, Sima, Chonghao, Song, Jianheng, Wang, Huijie, Wang, Wenhao, Wei, Dafeng, Xie, Chengen, Xu, Guo, Yan, Junchi, Yang, Cunbiao, Yang, Lei, Yang, Shukai, Yao, Maoqing, Zeng, Jia, Zhang, Chi, Zhang, Qinglin, Zhao, Bin, Zhao, Chengyue, Zhao, Jiaqi, Zhu, Jianchao

arXiv.org Artificial IntelligenceMar-13-2025

We explore how scalable robot data can address real-world challenges for generalized robotic manipulation. Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios, we achieve an order-of-magnitude increase in data scale compared to existing datasets. Accelerated by a standardized collection pipeline with human-in-the-loop verification, AgiBot World guarantees high-quality and diverse data distribution. It is extensible from grippers to dexterous hands and visuo-tactile sensors for fine-grained skill acquisition. Building on top of data, we introduce Genie Operator-1 (GO-1), a novel generalist policy that leverages latent action representations to maximize data utilization, demonstrating predictable performance scaling with increased data volume. Policies pre-trained on our dataset achieve an average performance improvement of 30% over those trained on Open X-Embodiment, both in in-domain and out-of-distribution scenarios. GO-1 exhibits exceptional capability in real-world dexterous and long-horizon tasks, achieving over 60% success rate on complex tasks and outperforming prior RDT approach by 32%. By open-sourcing the dataset, tools, and models, we aim to democratize access to large-scale, high-quality robot data, advancing the pursuit of scalable and general-purpose intelligence.

artificial intelligence, dataset, manipulation, (15 more...)

arXiv.org Artificial Intelligence

2503.06669

Genre: Research Report (0.50)

Industry:

Education (0.48)
Consumer Products & Services (0.46)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

Rethinking Video Tokenization: A Conditioned Diffusion-based Approach

Yang, Nianzu, Li, Pandeng, Zhao, Liming, Li, Yang, Xie, Chen-Wei, Tang, Yehui, Lu, Xudong, Liu, Zhihang, Zheng, Yun, Liu, Yu, Yan, Junchi

arXiv.org Artificial IntelligenceMar-8-2025

Existing video tokenizers typically use the traditional Variational Autoencoder (VAE) architecture for video compression and reconstruction. However, to achieve good performance, its training process often relies on complex multi-stage training tricks that go beyond basic reconstruction loss and KL regularization. Among these tricks, the most challenging is the precise tuning of adversarial training with additional Generative Adversarial Networks (GANs) in the final stage, which can hinder stable convergence. In contrast to GANs, diffusion models offer more stable training processes and can generate higher-quality results. Inspired by these advantages, we propose CDT, a novel Conditioned Diffusion-based video Tokenizer, that replaces the GAN-based decoder with a conditional causal diffusion model. The encoder compresses spatio-temporal information into compact latents, while the decoder reconstructs videos through a reverse diffusion process conditioned on these latents. During inference, we incorporate a feature cache mechanism to generate videos of arbitrary length while maintaining temporal continuity and adopt sampling acceleration technique to enhance efficiency. Trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch, extensive experiments demonstrate that CDT achieves state-of-the-art performance in video reconstruction tasks with just a single-step sampling. Even a scaled-down version of CDT (3$\times$ inference speedup) still performs comparably with top baselines. Moreover, the latent video generation model trained with CDT also exhibits superior performance. The source code and pretrained weights will be released shortly, so please stay tuned for updates!

artificial intelligence, machine learning, video, (16 more...)

arXiv.org Artificial Intelligence

2503.03708

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving

Jia, Xiaosong, You, Junqi, Zhang, Zhiyuan, Yan, Junchi

arXiv.org Artificial IntelligenceMar-7-2025

End-to-end autonomous driving (E2E-AD) has emerged as a trend in the field of autonomous driving, promising a data-driven, scalable approach to system design. However, existing E2E-AD methods usually adopt the sequential paradigm of perception-prediction-planning, which leads to cumulative errors and training instability. The manual ordering of tasks also limits the system`s ability to leverage synergies between tasks (for example, planning-aware perception and game-theoretic interactive prediction and planning). Moreover, the dense BEV representation adopted by existing methods brings computational challenges for long-range perception and long-term temporal fusion. To address these challenges, we present DriveTransformer, a simplified E2E-AD framework for the ease of scaling up, characterized by three key features: Task Parallelism (All agent, map, and planning queries direct interact with each other at each block), Sparse Representation (Task queries direct interact with raw sensor features), and Streaming Processing (Task queries are stored and passed as history information). As a result, the new framework is composed of three unified operations: task self-attention, sensor cross-attention, temporal cross-attention, which significantly reduces the complexity of system and leads to better training stability. DriveTransformer achieves state-of-the-art performance in both simulated closed-loop benchmark Bench2Drive and real world open-loop benchmark nuScenes with high FPS.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.07656

Genre: Research Report (0.50)

Industry:

Transportation > Ground > Road (0.93)
Automobiles & Trucks (0.93)
Information Technology > Robotics & Automation (0.83)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.88)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.83)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Wang, Jinbo, Wang, Mingze, Zhou, Zhanpeng, Yan, Junchi, E, Weinan, Wu, Lei

arXiv.org Machine LearningFeb-26-2025

Transformers consist of diverse building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feedforward networks. Thus, understanding the differences and interactions among these blocks is important. In this paper, we uncover a clear Sharpness Disparity across these blocks, which emerges early in training and intriguingly persists throughout the training process. Motivated by this finding, we propose Blockwise Learning Rate (LR), a strategy that tailors the LR to each block's sharpness, accelerating large language model (LLM) pre-training. By integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2\times$ speedup compared to vanilla AdamW. We demonstrate this acceleration across GPT-2 and LLaMA, with model sizes ranging from 0.12B to 1.1B and datasets of OpenWebText and MiniPile. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2\times$ speedup and $2\times$ memory saving. These results underscore the potential of exploiting the sharpness disparity to improve LLM training.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Machine Learning

2502.19002

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DiOpt: Self-supervised Diffusion for Constrained Optimization

Ding, Shutong, Zhou, Yimiao, Hu, Ke, Yao, Xi, Yan, Junchi, Tang, Xiaoying, Shi, Ye

arXiv.org Artificial IntelligenceFeb-14-2025

Recent advances in diffusion models show promising potential for learning-based optimization by leveraging their multimodal sampling capability to escape local optima. However, existing diffusion-based optimization approaches, often reliant on supervised training, lacks a mechanism to ensure strict constraint satisfaction which is often required in real-world applications. One resulting observation is the distributional misalignment, i.e. the generated solution distribution often exhibits small overlap with the feasible domain. In this paper, we propose DiOpt, a novel diffusion paradigm that systematically learns near-optimal feasible solution distributions through iterative self-training. Our framework introduces several key innovations: a target distribution specifically designed to maximize overlap with the constrained solution manifold; a bootstrapped self-training mechanism that adaptively weights candidate solutions based on the severity of constraint violations and optimality gaps; and a dynamic memory buffer that accelerates convergence by retaining high-quality solutions over training iterations. To our knowledge, DiOpt represents the first successful integration of self-supervised diffusion with hard constraint satisfaction. Evaluations on diverse tasks, including power grid control, motion retargeting, wireless allocation demonstrate its superiority in terms of both optimality and constraint satisfaction.

artificial intelligence, constraint, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2502.1033

Country: Asia > China (0.46)

Genre: Research Report (1.00)

Industry: Energy > Power Industry (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Wholly-WOOD: Wholly Leveraging Diversified-quality Labels for Weakly-supervised Oriented Object Detection

Yu, Yi, Yang, Xue, Li, Yansheng, Han, Zhenjun, Da, Feipeng, Yan, Junchi

arXiv.org Artificial IntelligenceFeb-13-2025

Accurately estimating the orientation of visual objects with compact rotated bounding boxes (RBoxes) has become a prominent demand, which challenges existing object detection paradigms that only use horizontal bounding boxes (HBoxes). To equip the detectors with orientation awareness, supervised regression/classification modules have been introduced at the high cost of rotation annotation. Meanwhile, some existing datasets with oriented objects are already annotated with horizontal boxes or even single points. It becomes attractive yet remains open for effectively utilizing weaker single point and horizontal annotations to train an oriented object detector (OOD). We develop Wholly-WOOD, a weakly-supervised OOD framework, capable of wholly leveraging various labeling forms (Points, HBoxes, RBoxes, and their combination) in a unified fashion. By only using HBox for training, our Wholly-WOOD achieves performance very close to that of the RBox-trained counterpart on remote sensing and other areas, significantly reducing the tedious efforts on labor-intensive annotation for oriented objects. The source codes are available at https://github.com/VisionXLab/whollywood (PyTorch-based) and https://github.com/VisionXLab/whollywood-jittor (Jittor-based).

artificial intelligence, detection, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2502.09471

Country: Asia > China (0.46)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Sports (0.67)
Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.35)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Point2RBox-v2: Rethinking Point-supervised Oriented Object Detection with Spatial Layout Among Instances

Yu, Yi, Ren, Botao, Zhang, Peiyuan, Liu, Mingxin, Luo, Junwei, Zhang, Shaofeng, Da, Feipeng, Yan, Junchi, Yang, Xue

arXiv.org Artificial IntelligenceFeb-6-2025

With the rapidly increasing demand for oriented object detection (OOD), recent research involving weakly-supervised detectors for learning OOD from point annotations has gained great attention. In this paper, we rethink this challenging task setting with the layout among instances and present Point2RBox-v2. At the core are three principles: 1) Gaussian overlap loss. It learns an upper bound for each instance by treating objects as 2D Gaussian distributions and minimizing their overlap. 2) Voronoi watershed loss. It learns a lower bound for each instance through watershed on Voronoi tessellation. 3) Consistency loss. It learns the size/rotation variation between two output sets with respect to an input image and its augmented view. Supplemented by a few devised techniques, e.g. edge loss and copy-paste, the detector is further enhanced. To our best knowledge, Point2RBox-v2 is the first approach to explore the spatial layout among instances for learning point-supervised OOD. Our solution is elegant and lightweight, yet it is expected to give a competitive performance especially in densely packed scenes: 62.61%/86.15%/34.71% on DOTA/HRSC/FAIR1M. Code is available at https://github.com/VisionXLab/point2rbox-v2.

artificial intelligence, detection, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2502.04268

Country: Asia > China (0.28)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Sports (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Fast T2T: Optimization Consistency Speeds Up Diffusion-Based Training-to-Testing Solving for Combinatorial Optimization

Li, Yang, Guo, Jinpei, Wang, Runzhong, Zha, Hongyuan, Yan, Junchi

arXiv.org Artificial IntelligenceFeb-5-2025

Diffusion models have recently advanced Combinatorial Optimization (CO) as a powerful backbone for neural solvers. However, their iterative sampling process requiring denoising across multiple noise levels incurs substantial overhead. We propose to learn direct mappings from different noise levels to the optimal solution for a given instance, facilitating high-quality generation with minimal shots. This is achieved through an optimization consistency training protocol, which, for a given instance, minimizes the difference among samples originating from varying generative trajectories and time steps relative to the optimal solution. The proposed model enables fast single-step solution generation while retaining the option of multi-step sampling to trade for sampling quality, which offers a more effective and efficient alternative backbone for neural solvers. In addition, within the training-to-testing (T2T) framework, to bridge the gap between training on historical instances and solving new instances, we introduce a novel consistency-based gradient search scheme during the test stage, enabling more effective exploration of the solution space learned during training. It is achieved by updating the latent solution probabilities under objective gradient guidance during the alternation of noise injection and denoising steps. We refer to this model as Fast T2T. Extensive experiments on two popular tasks, the Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS), demonstrate the superiority of Fast T2T regarding both solution quality and efficiency, even outperforming LKH given limited time budgets. Notably, Fast T2T with merely one-step generation and one-step gradient search can mostly outperform the SOTA diffusion-based counterparts that require hundreds of steps, while achieving tens of times speedup.

artificial intelligence, difusco, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2502.02941

Country:

Asia > China (0.28)
North America > United States > Massachusetts (0.14)

Genre: Research Report > Experimental Study (0.93)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)

Add feedback

PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection

Zhang, Peiyuan, Luo, Junwei, Yang, Xue, Yu, Yi, Li, Qingyun, Zhou, Yue, Jia, Xiaosong, Lu, Xudong, Chen, Jingdong, Li, Xiang, Yan, Junchi, Li, Yansheng

arXiv.org Artificial IntelligenceJan-23-2025

With the growing demand for oriented object detection (OOD), recent studies on point-supervised OOD have attracted significant interest. In this paper, we propose PointOBB-v3, a stronger single point-supervised OOD framework. Compared to existing methods, it generates pseudo rotated boxes without additional priors and incorporates support for the end-to-end paradigm. PointOBB-v3 functions by integrating three unique image views: the original view, a resized view, and a rotated/flipped (rot/flp) view. Based on the views, a scale augmentation module and an angle acquisition module are constructed. In the first module, a Scale-Sensitive Consistency (SSC) loss and a Scale-Sensitive Feature Fusion (SSFF) module are introduced to improve the model's ability to estimate object scale. To achieve precise angle predictions, the second module employs symmetry-based self-supervised learning. Additionally, we introduce an end-to-end version that eliminates the pseudo-label generation process by integrating a detector branch and introduces an Instance-Aware Weighting (IAW) strategy to focus on high-quality predictions. We conducted extensive experiments on the DIOR-R, DOTA-v1.0/v1.5/v2.0, FAIR1M, STAR, and RSAR datasets. Across all these datasets, our method achieves an average improvement in accuracy of 3.56% in comparison to previous state-of-the-art methods. The code will be available at https://github.com/ZpyWHU/PointOBB-v3.

artificial intelligence, detection, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2501.13898

Country: Asia > China (1.00)

Genre: Research Report > Promising Solution (0.48)

Industry: Leisure & Entertainment > Sports (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback