Goto

Collaborating Authors

 topk


TowardsCrowdsourcedTrainingofLargeNeural NetworksusingDecentralizedMixture-of-Experts SupplementaryMaterial

Neural Information Processing Systems

With this data structure, DMoE can use beam search toselect the best experts. Manypopular architectures, including Transformers, can train entirely in that precision mode [7]. In addition, the deep learning architectures discussed in this work rely on backpropagation for training.


P topk(Aj,: (1 X)),21 ho,j=Aj,: X P topk(Aj,: X) + P topk(A

Neural Information Processing Systems

We categorize existing implementations2 into 2 kinds: (1) for verification only (typically implemented on CPUs, including DeepZ[35], and DeepPoly[37])3 (2) for training certified defense (typically using more efficient, yet weaker or approximated bounds: convex outer4 adversarial polytope[45], DiffAI[28], IBP[9] andCROWN-IBP[50]). Ourcontributionisnot8 to improve tightness of LiRPA bounds, but the first framework that generalizes to general computational graphs in9 anautomatic manner. In CROWN[50], the quadratic bound is only applied to 2-layer networks and is hard to extend to14 multiplelayers,aswhenpropagatingaquadratic boundtothe3rdlayeritbecomes quadratic (x4)duetocorrelations15 between twoquadratic terms ("order explosion").


Scalable Utility-Aware Multiclass Calibration

Hegazy, Mahmoud, Jordan, Michael I., Dieuleveut, Aymeric

arXiv.org Machine Learning

Ensuring that classifiers are well-calibrated, i.e., their predictions align with observed frequencies, is a minimal and fundamental requirement for classifiers to be viewed as trustworthy. Existing methods for assessing multiclass calibration often focus on specific aspects associated with prediction (e.g., top-class confidence, class-wise calibration) or utilize computationally challenging variational formulations. In this work, we study scalable \emph{evaluation} of multiclass calibration. To this end, we propose utility calibration, a general framework that measures the calibration error relative to a specific utility function that encapsulates the goals or decision criteria relevant to the end user. We demonstrate how this framework can unify and re-interpret several existing calibration metrics, particularly allowing for more robust versions of the top-class and class-wise calibration metrics, and, going beyond such binarized approaches, toward assessing calibration for richer classes of downstream utilities.



LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts

Zhuang, Yuan, Shen, Yi, Bian, Yuexin, Su, Qing, Ji, Shihao, Shi, Yuanyuan, Miao, Fei

arXiv.org Artificial Intelligence

Recent studies have shown that combining parameter-efficient fine-tuning (PEFT) with mixture-of-experts (MoE) is an effective strategy for adapting large language models (LLMs) to the downstream tasks. However, most existing approaches rely on conventional TopK routing, which requires careful hyperparameter tuning and assigns a fixed number of experts to each token. In this work, we propose LD-MoLE, a Learnable Dynamic routing mechanism for Mixture of LoRA Experts that enables adaptive, token-dependent, and layer-wise expert allocation. Our method replaces the non-differentiable TopK selection with a differentiable routing function and a closed-form solution. Moreover, our design allows the model to adaptively determine the number of experts to activate for each token at different layers. In addition, we introduce an analytical sparsity control objective to regularize the number of activated experts. Our method not only achieves superior performance, but also demonstrates the ability to learn token-dependent and layer-wise expert allocation. Large language models (LLMs) have demonstrated impressive capabilities across a wide range of natural language processing (NLP) tasks. However, their growing size requires significant computational resources for full-parameter fine-tuning. To address this, Parameter-Efficient Fine-tuning (PEFT) methods, such as Adapter-tuning (Houlsby et al., 2019) and LoRA (Hu et al., 2021), have emerged as crucial techniques for reducing training costs. Recently, the Mixture-of-Experts (MoE) design (Jacobs et al., 1991; Shazeer et al., 2017) has been successfully integrated into transformer feed-forward networks during LLMs pretraining (Dai et al., 2024; Y ang et al., 2025), demonstrating that MoE can reduce computational cost while maintaining strong performance.




SplInterp: Improving our Understanding and Training of Sparse Autoencoders

Budd, Jeremy, Ideami, Javier, Rynne, Benjamin Macdowall, Duggar, Keith, Balestriero, Randall

arXiv.org Artificial Intelligence

Sparse autoencoders (SAEs) have received considerable recent attention as tools for mechanistic interpretability, showing success at extracting interpretable features even from very large LLMs. However, this research has been largely empirical, and there have been recent doubts about the true utility of SAEs. In this work, we seek to enhance the theoretical understanding of SAEs, using the spline theory of deep learning. By situating SAEs in this framework: we discover that SAEs generalise ``$k$-means autoencoders'' to be piecewise affine, but sacrifice accuracy for interpretability vs. the optimal ``$k$-means-esque plus local principal component analysis (PCA)'' piecewise affine autoencoder. We characterise the underlying geometry of (TopK) SAEs using power diagrams. And we develop a novel proximal alternating method SGD (PAM-SGD) algorithm for training SAEs, with both solid theoretical foundations and promising empirical results in MNIST and LLM experiments, particularly in sample efficiency and (in the LLM setting) improved sparsity of codes. All code is available at: https://github.com/splInterp2025/splInterp


Towards Efficient Training of Graph Neural Networks: A Multiscale Approach

Gal, Eshed, Eliasof, Moshe, Schönlieb, Carola-Bibiane, Haber, Eldad, Treister, Eran

arXiv.org Artificial Intelligence

Graph Neural Networks (GNNs) have emerged as a powerful tool for learning and inferring from graph-structured data, and are widely used in a variety of applications, often considering large amounts of data and large graphs. However, training on such data requires large memory and extensive computations. In this paper, we introduce a novel framework for efficient multiscale training of GNNs, designed to integrate information across multiscale representations of a graph. Our approach leverages a hierarchical graph representation, taking advantage of coarse graph scales in the training process, where each coarse scale graph has fewer nodes and edges. Based on this approach, we propose a suite of GNN training methods: such as coarse-to-fine, sub-to-full, and multiscale gradient computation. We demonstrate the effectiveness of our methods on various datasets and learning tasks.


Interpreting CLIP with Hierarchical Sparse Autoencoders

Zaigrajew, Vladimir, Baniecki, Hubert, Biecek, Przemyslaw

arXiv.org Artificial Intelligence

Sparse autoencoders (SAEs) are useful for detecting and steering interpretable features in neural networks, with particular potential for understanding complex multimodal representations. Given their ability to uncover interpretable features, SAEs are particularly valuable for analyzing large-scale vision-language models (e.g., CLIP and SigLIP), which are fundamental building blocks in modern systems yet remain challenging to interpret and control. However, current SAE methods are limited by optimizing both reconstruction quality and sparsity simultaneously, as they rely on either activation suppression or rigid sparsity constraints. To this end, we introduce Matryoshka SAE (MSAE), a new architecture that learns hierarchical representations at multiple granularities simultaneously, enabling a direct optimization of both metrics without compromise. MSAE establishes a new state-of-the-art Pareto frontier between reconstruction quality and sparsity for CLIP, achieving 0.99 cosine similarity and less than 0.1 fraction of variance unexplained while maintaining ~80% sparsity. Finally, we demonstrate the utility of MSAE as a tool for interpreting and controlling CLIP by extracting over 120 semantic concepts from its representation to perform concept-based similarity search and bias analysis in downstream tasks like CelebA.