afd
Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding
StepFun, null, :, null, Wang, Bin, Wang, Bojun, Wan, Changyi, Huang, Guanzhe, Hu, Hanpeng, Jia, Haonan, Nie, Hao, Li, Mingliang, Chen, Nuo, Chen, Siyu, Yuan, Song, Xie, Wuxun, Song, Xiaoniu, Chen, Xing, Yang, Xingping, Zhang, Xuelin, Yu, Yanbo, Wang, Yaoyu, Zhu, Yibo, Jiang, Yimin, Zhou, Yu, Lu, Yuanwei, Li, Houyi, Hu, Jingcheng, Lo, Ka Man, Huang, Ailin, Jiao, Binxing, Li, Bo, Chen, Boyu, Miao, Changxin, Lou, Chang, Hu, Chen, Xu, Chen, Yu, Chenfeng, Yao, Chengyuan, Lv, Daokuan, Shi, Dapeng, Sun, Deshan, Huang, Ding, Hu, Dingyuan, Pang, Dongqing, Liu, Enle, Zhang, Fajie, Wan, Fanqi, Yan, Gulin, Zhang, Han, Zhou, Han, Wu, Hanghao, Guo, Hangyu, Chen, Hanqi, Zhang, Hanshan, Wu, Hao, Zhang, Haocheng, Yan, Haolong, Lv, Haoran, Wei, Haoran, Zhou, Hebin, Wang, Heng, Wang, Heng, Li, Hongxin, Zhou, Hongyu, Wang, Hongyuan, Guo, Huiyong, Wang, Jia, Gong, Jiahao, Xie, Jialing, Zhou, Jian, Sun, Jianjian, Wu, Jiaoren, Zhang, Jiaran, Liu, Jiayu, Cheng, Jie, Luo, Jie, Yan, Jie, Yang, Jie, Hou, Jieyi, Zhang, Jinguang, Cao, Jinlan, Yin, Jisheng, Liu, Junfeng, Huang, Junhao, Lin, Junzhe, Tan, Kaijun, Li, Kaixiang, An, Kang, Lin, Kangheng, Liu, Kenkun, Yang, Lei, Zhao, Liang, Chen, Liangyu, Shi, Lieyu, Tan, Liguo, Lin, Lin, Zhang, Lin, Chen, Lina, Huang, Liwen, Shi, Liying, Gu, Longlong, Chen, Mei, Ren, Mengqiang, Li, Ming, Chen, Mingzhe, Wang, Na, Wu, Nan, Han, Qi, Zhao, Qian, Zhang, Qiang, Liu, Qianni, Chen, Qiaohui, Wu, Qiling, He, Qinglin, Tan, Qinyuan, Wang, Qiufeng, Wu, Qiuping, Liang, Qiuyan, Sun, Quan, Li, Rui, Miao, Ruihang, Wan, Ruosi, Guo, Ruyan, Zhong, Shangwu, Pang, Shaoliang, Fan, Shengjie, Shang, Shijie, Jiang, Shilei, Yang, Shiliang, Hao, Shiming, Gao, Shuli, Huang, Siming, Liu, Siqi, Cao, Tiancheng, Cheng, Tianhao, Peng, Tianhao, You, Wang, Ji, Wei, Sun, Wen, Deng, Wenjin, He, Wenqing, Zheng, Wenzhen, Chen, Xi, Kong, Xiangwen, Luo, Xianzhen, Yang, Xiaobo, Liu, Xiaojia, Ren, Xiaoxiao, Han, Xin, Li, Xin, Wu, Xin, Zhao, Xu, Wei, Yanan, Li, Yang, Li, Yangguang, Xu, Yangshijie, Xu, Yanming, Shi, Yaqiang, Shen, Yeqing, Yang, Yi, Yang, Yifei, Gong, Yifeng, Chen, Yihan, Yang, Yijing, Zhang, Yinmin, Zhou, Yizhuang, Ding, Yuanhao, Fan, Yuantao, Yang, Yuanzhen, Luo, Yuchu, Peng, Yue, Lu, Yufan, Deng, Yuhang, Yin, Yuhe, Liu, Yujie, Chen, Yukun, Zhao, Yuling, Mou, Yun, Li, Yunlong, Ju, Yunzhou, Li, Yusheng, Yang, Yuxiang, Zhang, Yuxiang, Chen, Yuyang, Weng, Zejia, Xie, Zhe, Ge, Zheng, Gong, Zheng, Lu, Zhenyi, Huang, Zhewei, Chang, Zhichao, Huang, Zhiguo, Wang, Zhirui, Yang, Zidong, Wang, Zili, Wang, Ziqi, Zhang, Zixin, Jiao, Binxing, Jiang, Daxin, Shum, Heung-Yeung, Zhang, Xiangyu
Large language models (LLMs) face low hardware efficiency during decoding, especially for long-context reasoning tasks. This paper introduces Step-3, a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs. Step-3 innovates in two key dimensions: (1) A novel Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces both KV cache size and computation while maintaining high attention expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers into specialized subsystems. This co-design achieves unprecedented cost efficiency: Step-3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at longer context. Step-3 achieves low cost while activating 38B parameters per token (more than DeepSeek-V3 and Qwen3 MoE 235B), demonstrating that hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD are critical to cost-effectiveness. We perform a head-to-head comparison with DeepSeek-V3 in its favorable scenarios. Our implementation on Hopper GPUs achieves a decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA (4K context, FP8, no MTP). It is higher than DeepSeek-V3's 2,324 in the same setup and sets a new Pareto frontier for LLM decoding.
Alignment Helps Make the Most of Multimodal Data
Arnold, Christian, Kรผpfer, Andreas
When studying political communication, combining the information from text, audio, and video signals promises to reflect the richness of human communication more comprehensively than confining it to individual modalities alone. However, its heterogeneity, connectedness, and interaction are challenging to address when modeling such multimodal data. We argue that aligning the respective modalities can be an essential step in entirely using the potential of multimodal data because it informs the model with human understanding. Taking care of the data-generating process of multimodal data, our framework proposes four principles to organize alignment and, thus, address the challenges of multimodal data. We illustrate the utility of these principles by analyzing how German MPs address members of the far-right AfD in their speeches and predicting the tone of video advertising in the context of the 2020 US presidential race. Our paper offers important insights to all keen to analyze multimodal data effectively.
Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment
Sun, Hao, van der Schaar, Mihaela
Aligning Large Language Models (LLMs) is crucial for enhancing their safety and utility. However, existing methods, primarily based on preference datasets, face challenges such as noisy labels, high annotation costs, and privacy concerns. In this work, we introduce Alignment from Demonstrations (AfD), a novel approach leveraging high-quality demonstration data to overcome these challenges. We formalize AfD within a sequential decision-making framework, highlighting its unique challenge of missing reward signals. Drawing insights from forward and inverse reinforcement learning, we introduce divergence minimization objectives for AfD. Analytically, we elucidate the mass-covering and mode-seeking behaviors of various approaches, explaining when and why certain methods are superior. Practically, we propose a computationally efficient algorithm that extrapolates over a tailored reward model for AfD. We validate our key insights through experiments on the Harmless and Helpful tasks, demonstrating their strong empirical performance while maintaining simplicity.
Mitigating Feature Gap for Adversarial Robustness by Feature Disentanglement
Zhou, Nuoyan, Zhou, Dawei, Liu, Decheng, Gao, Xinbo, Wang, Nannan
Deep neural networks are vulnerable to adversarial samples. Adversarial fine-tuning methods aim to enhance adversarial robustness through fine-tuning the naturally pre-trained model in an adversarial training manner. However, we identify that some latent features of adversarial samples are confused by adversarial perturbation and lead to an unexpectedly increasing gap between features in the last hidden layer of natural and adversarial samples. To address this issue, we propose a disentanglement-based approach to explicitly model and further remove the latent features that cause the feature gap. Specifically, we introduce a feature disentangler to separate out the latent features from the features of the adversarial samples, thereby boosting robustness by eliminating the latent features. Besides, we align features in the pre-trained model with features of adversarial samples in the fine-tuned model, to further benefit from the features from natural samples without confusion. Empirical evaluations on three benchmark datasets demonstrate that our approach surpasses existing adversarial fine-tuning methods and adversarial training baselines.
Fault-Tolerant Offline Multi-Agent Path Planning
Okumura, Keisuke, Tixeuil, Sรฉbastien
We study a novel graph path planning problem for multiple agents that may crash at runtime, and block part of the workspace. In our setting, agents can detect neighboring crashed agents, and change followed paths at runtime. The objective is then to prepare a set of paths and switching rules for each agent, ensuring that all correct agents reach their destinations without collisions or deadlocks, despite unforeseen crashes of other agents. Such planning is attractive to build reliable multi-robot systems. We present problem formalization, theoretical analysis such as computational complexities, and how to solve this offline planning problem.
Is Appearance Free Action Recognition Possible?
Ilic, Filip, Pock, Thomas, Wildes, Richard P.
Intuition might suggest that motion and dynamic information are key to video-based action recognition. In contrast, there is evidence that state-of-the-art deep-learning video understanding architectures are biased toward static information available in single frames. Presently, a methodology and corresponding dataset to isolate the effects of dynamic information in video are missing. Their absence makes it difficult to understand how well contemporary architectures capitalize on dynamic vs. static information. We respond with a novel Appearance Free Dataset (AFD) for action recognition. AFD is devoid of static information relevant to action recognition in a single frame. Modeling of the dynamics is necessary for solving the task, as the action is only apparent through consideration of the temporal dimension. We evaluated 11 contemporary action recognition architectures on AFD as well as its related RGB video. Our results show a notable decrease in performance for all architectures on AFD compared to RGB. We also conducted a complimentary study with humans that shows their recognition accuracy on AFD and RGB is very similar and much better than the evaluated architectures on AFD. Our results motivate a novel architecture that revives explicit recovery of optical flow, within a contemporary design for best performance on AFD and RGB.
Generating Gameplay-Relevant Art Assets with Transfer Learning
Gonzalez, Adrian, Guzdial, Matthew, Ramos, Felix
In game development, designing compelling visual assets that convey gameplay-relevant features requires time and experience. Recent image generation methods that create high-quality content could reduce development costs, but these approaches do not consider game mechanics. We propose a Convolutional Variational Autoencoder (CVAE) system to modify and generate new game visuals based on their gameplay relevance. We test this approach with Pok\'emon sprites and Pok\'emon type information, since types are one of the game's core mechanics and they directly impact the game's visuals. Our experimental results indicate that adopting a transfer learning approach can help to improve visual quality and stability over unseen data.
Feature-map-level Online Adversarial Knowledge Distillation
Chung, Inseop, Park, SeongUk, Kim, Jangho, Kwak, Nojun
Feature maps contain rich information about image intensity and spatial correlation. However, previous online knowledge distillation methods only utilize the class probabilities. Thus in this paper, we propose an online knowledge distillation method that transfers not only the knowledge of the class probabilities but also that of the feature map using the adversarial training framework. We train multiple networks simultaneously by employing discriminators to distinguish the feature map distributions of different networks. Each network has its corresponding discriminator which discriminates the feature map from its own as fake while classifying that of the other network as real. By training a network to fool the corresponding discriminator, it can learn the other network's feature map distribution. We show that our method performs better than the conventional direct alignment method such as L1 and is more suitable for online distillation. Also, we propose a novel cyclic learning scheme for training more than two networks together. We have applied our method to various network architectures on the classification task and discovered a significant improvement of performance especially in the case of training a pair of a small network and a large one.
Automatic State Abstraction from Demonstration
Cobo, Luis Carlos (Georgia Institute of Technology) | Zang, Peng (Georgia Institute of Technology) | Jr., Charles Lee Isbell (Georgia Institute of Technology) | Thomaz, Andrea Lockerd (Georgia Institute of Technology)
Learning from Demonstration (LfD) is a popular technique for building decision-making agents from human help. Traditional LfD methods use demonstrations as training examples for supervised learning, but complex tasks can require more examples than is practical to obtain. We present Abstraction from Demonstration (AfD), a novel form of LfD that uses demonstrations to infer state abstractions and reinforcement learning (RL) methods in those abstract state spaces to build a policy. Empirical results show that AfD is greater than an order of magnitude more sample efficient than jus tusing demonstrations as training examples, and exponentially faster than RL alone.