AITopics | topk

Collaborating Authors

topk

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Transformers Learn Faster with Semantic Focus

Neural Information Processing SystemsJun-16-2026, 01:36:27 GMT

Various forms of sparse attention have been explored to mitigate the quadratic computational and memory cost of the attention mechanism in transformers. We study sparse transformers not through a lens of efficiency but rather in terms of learnability and generalization. Empirically studying a range of attention mechanisms, we find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models, while input-agnostic sparse attention models show no such benefits - a phenomenon that is robust across architectural and optimization hyperparameter choices. This can be interpreted as demonstrating that concentrating a model's "semantic focus" with respect to the tokens currently being considered (in the form of input-dependent sparse attention) accelerates learning. We develop a theoretical characterization of the conditions that explain this behavior. We establish a connection between the stability of the standard softmax and the loss function's Lipschitz properties, then show how sparsity affects the stability of the softmax and the subsequent convergence and generalization guarantees resulting from the attention mechanism. This allows us to theoretically establish that input-agnostic sparse attention does not provide any benefits. We also characterize conditions when semantic focus (input-dependent sparse attention) can provide improved guarantees, and we validate that these conditions are in fact met in our empirical evaluations.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry:

Information Technology (0.45)
Health & Medicine (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Add feedback

TowardsCrowdsourcedTrainingofLargeNeural NetworksusingDecentralizedMixture-of-Experts SupplementaryMaterial

Neural Information Processing SystemsFeb-7-2026, 20:35:21 GMT

With this data structure, DMoE can use beam search toselect the best experts. Manypopular architectures, including Transformers, can train entirely in that precision mode [7]. In addition, the deep learning architectures discussed in this work rely on backpropagation for training.

artificial intelligence, ffn, machine learning, (14 more...)

Neural Information Processing Systems

Country: North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.37)

Add feedback

P topk(Aj,: (1 X)),21 ho,j=Aj,: X P topk(Aj,: X) + P topk(A

Neural Information Processing SystemsFeb-7-2026, 11:04:15 GMT

We categorize existing implementations2 into 2 kinds: (1) for verification only (typically implemented on CPUs, including DeepZ[35], and DeepPoly[37])3 (2) for training certified defense (typically using more efficient, yet weaker or approximated bounds: convex outer4 adversarial polytope[45], DiffAI[28], IBP[9] andCROWN-IBP[50]). Ourcontributionisnot8 to improve tightness of LiRPA bounds, but the first framework that generalizes to general computational graphs in9 anautomatic manner. In CROWN[50], the quadratic bound is only applied to 2-layer networks and is hard to extend to14 multiplelayers,aswhenpropagatingaquadratic boundtothe3rdlayeritbecomes quadratic (x4)duetocorrelations15 between twoquadratic terms ("order explosion").

artificial intelligence, machine learning, topk, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.37)

Add feedback

Scalable Utility-Aware Multiclass Calibration

Hegazy, Mahmoud, Jordan, Michael I., Dieuleveut, Aymeric

arXiv.org Machine LearningOct-30-2025

Ensuring that classifiers are well-calibrated, i.e., their predictions align with observed frequencies, is a minimal and fundamental requirement for classifiers to be viewed as trustworthy. Existing methods for assessing multiclass calibration often focus on specific aspects associated with prediction (e.g., top-class confidence, class-wise calibration) or utilize computationally challenging variational formulations. In this work, we study scalable \emph{evaluation} of multiclass calibration. To this end, we propose utility calibration, a general framework that measures the calibration error relative to a specific utility function that encapsulates the goals or decision criteria relevant to the end user. We demonstrate how this framework can unify and re-interpret several existing calibration metrics, particularly allowing for more robust versions of the top-class and class-wise calibration metrics, and, going beyond such binarized approaches, toward assessing calibration for richer classes of downstream utilities.

calibration, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2510.25458

Country:

Asia > Middle East > Jordan (0.05)
Europe > France > Île-de-France > Paris > Paris (0.04)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Natural Language (0.92)

Add feedback

0cbc5671ae26f67871cb914d81ef8fc1-AuthorFeedback.pdf

Neural Information Processing SystemsOct-2-2025, 00:48:33 GMT

artificial intelligence, machine learning, transformer, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.54)

Add feedback

LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts

Zhuang, Yuan, Shen, Yi, Bian, Yuexin, Su, Qing, Ji, Shihao, Shi, Yuanyuan, Miao, Fei

arXiv.org Artificial IntelligenceOct-1-2025

Recent studies have shown that combining parameter-efficient fine-tuning (PEFT) with mixture-of-experts (MoE) is an effective strategy for adapting large language models (LLMs) to the downstream tasks. However, most existing approaches rely on conventional TopK routing, which requires careful hyperparameter tuning and assigns a fixed number of experts to each token. In this work, we propose LD-MoLE, a Learnable Dynamic routing mechanism for Mixture of LoRA Experts that enables adaptive, token-dependent, and layer-wise expert allocation. Our method replaces the non-differentiable TopK selection with a differentiable routing function and a closed-form solution. Moreover, our design allows the model to adaptively determine the number of experts to activate for each token at different layers. In addition, we introduce an analytical sparsity control objective to regularize the number of activated experts. Our method not only achieves superior performance, but also demonstrates the ability to learn token-dependent and layer-wise expert allocation. Large language models (LLMs) have demonstrated impressive capabilities across a wide range of natural language processing (NLP) tasks. However, their growing size requires significant computational resources for full-parameter fine-tuning. To address this, Parameter-Efficient Fine-tuning (PEFT) methods, such as Adapter-tuning (Houlsby et al., 2019) and LoRA (Hu et al., 2021), have emerged as crucial techniques for reducing training costs. Recently, the Mixture-of-Experts (MoE) design (Jacobs et al., 1991; Shazeer et al., 2017) has been successfully integrated into transformer feed-forward networks during LLMs pretraining (Dai et al., 2024; Y ang et al., 2025), demonstrating that MoE can reduce computational cost while maintaining strong performance.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.25684

Country: North America > United States (0.46)

Genre: Research Report (0.85)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.48)

Add feedback

eae15aabaa768ae4a5993a8a4f4fa6e4-Supplemental.pdf

Neural Information Processing SystemsAug-22-2025, 01:03:02 GMT

absolute value, contradiction, inequality, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.31)

Add feedback

bcc0d400288793e8bdcd7c19a8ac0c2b-Supplemental.pdf

Neural Information Processing SystemsAug-17-2025, 03:26:19 GMT

artificial intelligence, dataset, machine learning, (17 more...)

Neural Information Processing Systems

Country:

Europe (0.04)
Asia > China (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

SplInterp: Improving our Understanding and Training of Sparse Autoencoders

Budd, Jeremy, Ideami, Javier, Rynne, Benjamin Macdowall, Duggar, Keith, Balestriero, Randall

arXiv.org Artificial IntelligenceMay-20-2025

Sparse autoencoders (SAEs) have received considerable recent attention as tools for mechanistic interpretability, showing success at extracting interpretable features even from very large LLMs. However, this research has been largely empirical, and there have been recent doubts about the true utility of SAEs. In this work, we seek to enhance the theoretical understanding of SAEs, using the spline theory of deep learning. By situating SAEs in this framework: we discover that SAEs generalise ``$k$-means autoencoders'' to be piecewise affine, but sacrifice accuracy for interpretability vs. the optimal ``$k$-means-esque plus local principal component analysis (PCA)'' piecewise affine autoencoder. We characterise the underlying geometry of (TopK) SAEs using power diagrams. And we develop a novel proximal alternating method SGD (PAM-SGD) algorithm for training SAEs, with both solid theoretical foundations and promising empirical results in MNIST and LLM experiments, particularly in sample efficiency and (in the LLM setting) improved sparsity of codes. All code is available at: https://github.com/splInterp2025/splInterp

artificial intelligence, diagram, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2505.11836

Country: Europe (0.28)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Towards Efficient Training of Graph Neural Networks: A Multiscale Approach

Gal, Eshed, Eliasof, Moshe, Schönlieb, Carola-Bibiane, Haber, Eldad, Treister, Eran

arXiv.org Artificial IntelligenceMar-26-2025

Graph Neural Networks (GNNs) have emerged as a powerful tool for learning and inferring from graph-structured data, and are widely used in a variety of applications, often considering large amounts of data and large graphs. However, training on such data requires large memory and extensive computations. In this paper, we introduce a novel framework for efficient multiscale training of GNNs, designed to integrate information across multiscale representations of a graph. Our approach leverages a hierarchical graph representation, taking advantage of coarse graph scales in the training process, where each coarse scale graph has fewer nodes and edges. Based on this approach, we propose a suite of GNN training methods: such as coarse-to-fine, sub-to-full, and multiscale gradient computation. We demonstrate the effectiveness of our methods on various datasets and learning tasks.

artificial intelligence, graph, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2503.19666

Country: