AITopics | kda

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team, null, Zhang, Yu, Lin, Zongyu, Yao, Xingcheng, Hu, Jiaxi, Meng, Fanqing, Liu, Chengyin, Men, Xin, Yang, Songlin, Li, Zhiyuan, Li, Wentao, Lu, Enzhe, Liu, Weizhou, Chen, Yanru, Xu, Weixin, Yu, Longhui, Wang, Yejie, Fan, Yu, Zhong, Longguang, Yuan, Enming, Zhang, Dehao, Zhang, Yizhi, Liu, T. Y., Wang, Haiming, Fang, Shengjun, He, Weiran, Liu, Shaowei, Li, Yiwei, Su, Jianlin, Qiu, Jiezhong, Pang, Bo, Yan, Junjie, Jiang, Zhejun, Huang, Weixiao, Yin, Bohong, You, Jiacheng, Wei, Chu, Wang, Zhengtao, Hong, Chao, Chen, Yutian, Chen, Guanduo, Wang, Yucheng, Zheng, Huabin, Wang, Feng, Liu, Yibo, Dong, Mengnan, Zhang, Zheng, Pan, Siyuan, Wu, Wenhao, Wu, Yuhao, Guan, Longyu, Tao, Jiawen, Fu, Guohong, Xu, Xinran, Wang, Yuzhi, Lai, Guokun, Wu, Yuxin, Zhou, Xinyu, Yang, Zhilin, Du, Yulun

arXiv.org Artificial IntelligenceNov-4-2025

We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.26692

Country: Asia > China (0.45)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)

Add feedback

KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs

Liang, Buyun, Chan, Kwan Ho Ryan, Thaker, Darshan, Luo, Jinqi, Vidal, René

arXiv.org Artificial IntelligenceFeb-5-2025

Jailbreak attacks exploit specific prompts to bypass LLM safeguards, causing the LLM to generate harmful, inappropriate, and misaligned content. Current jailbreaking methods rely heavily on carefully designed system prompts and numerous queries to achieve a single successful attack, which is costly and impractical for large-scale red-teaming. To address this challenge, we propose to distill the knowledge of an ensemble of SOTA attackers into a single open-source model, called Knowledge-Distilled Attacker (KDA), which is finetuned to automatically generate coherent and diverse attack prompts without the need for meticulous system prompt engineering. Compared to existing attackers, KDA achieves higher attack success rates and greater cost-time efficiency when targeting multiple SOTA open-source and commercial black-box LLMs. Furthermore, we conducted a quantitative diversity analysis of prompts generated by baseline methods and KDA, identifying diverse and ensemble attacks as key factors behind KDA's effectiveness and efficiency.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.05223

Country:

North America > United States > Pennsylvania (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Monaco (0.04)

Genre: Research Report (0.83)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

Multi-Class Classifiers and Their Underlying Shared Structure

Vural, Volkan (Northeastern University) | Fung, Glenn (Siemens Medical Solutions, Inc) | Rosales, Romer (Siemens Medical Solutions, Inc) | Dy, Jennifer G. (Northeastern University)

AAAI ConferencesJun-23-2009

Multi-class problems have a richer structure than binary classification problems. Thus, they can potentially improve their performance by exploiting the relationship among class labels. While for the purposes of providing an automated classification result this class structure does not need to be explicitly unveiled, for human level analysis or interpretation this is valuable. We develop a multi-class large margin classifier that extracts and takes advantage of class relationships. We provide a bi-convex formulation that explicitly learns a matrix that captures these class relationships and is de-coupled from the feature weights. Our representation can take advantage of the class structure to compress the model by reducing the number of classifiers employed, maintaining high accuracy even with large compression. In addition, we present an efficient formulation in terms of speed and memory.

equation, kda, pkda, (17 more...)

AAAI Conferences

Twenty-First International Joint Conference on Artificial Intelligence

Country:

North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Arizona (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Nonlinear Discriminant Analysis Using Kernel Functions

Roth, Volker, Steinhage, Volker

Neural Information Processing SystemsDec-31-2000

Fishers linear discriminant analysis (LDA) is a classical multivariate technique both for dimension reduction and classification. The data vectors are transformed into a low dimensional subspace such that the class centroids are spread out as much as possible. In this subspace LDA works as a simple prototype classifier with linear decision boundaries. However, in many applications the linear boundaries do not adequately separate the classes. We present a nonlinear generalization of discriminant analysis that uses the kernel trick of representing dot products by kernel functions.

artificial intelligence, discriminant analysis, machine learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Monterey County > Monterey (0.04)
Europe > Germany > North Rhine-Westphalia > Cologne Region > Bonn (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.73)

Add feedback

Nonlinear Discriminant Analysis Using Kernel Functions

Roth, Volker, Steinhage, Volker

Neural Information Processing SystemsDec-31-2000

Fishers linear discriminant analysis (LDA) is a classical multivariate technique both for dimension reduction and classification. The data vectors are transformed into a low dimensional subspace such that the class centroids are spread out as much as possible. In this subspace LDA works as a simple prototype classifier with linear decision boundaries. However, in many applications the linear boundaries do not adequately separate the classes. We present a nonlinear generalization of discriminant analysis that uses the kernel trick of representing dot products by kernel functions.

algorithm, discriminant analysis, vector, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Monterey County > Monterey (0.04)
Europe > Germany > North Rhine-Westphalia > Cologne Region > Bonn (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.74)

Add feedback

Nonlinear Discriminant Analysis Using Kernel Functions

Roth, Volker, Steinhage, Volker

Neural Information Processing SystemsDec-31-2000

Fishers linear discriminant analysis (LDA) is a classical multivariate technique both for dimension reduction and classification. The data vectors are transformed into a low dimensional subspace such that the class centroids are spread out as much as possible. In this subspace LDA works as a simple prototype classifier with linear decision boundaries. However, in many applications the linear boundaries do not adequately separate the classes. We present a nonlinear generalization of discriminant analysis that uses the kernel trick of representing dot products by kernel functions.

algorithm, discriminant analysis, vector, (16 more...)

Neural Information Processing Systems

Country: