AITopics | dselect-k

Collaborating Authors

dselect-k

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Supplementaryto"DSelect-k: Differentiable SelectionintheMixtureofExpertswithApplications toMulti-TaskLearning "

Neural Information Processing SystemsFeb-11-2026, 22:47:04 GMT

MTL: InMTL, deep learning-based architectures that perform soft-parameter sharing, i.e., share model parameters partially, are proving to be effective at exploiting both the commonalities and differences among tasks [6]. Ourwork is also related to [5] who introduced "routers" (similar to gates) that can choose which layers or components of layers to activate per-task. The routers in the latter work are not differentiable and requirereinforcementlearning. To construct α, there are two cases to consider: (i)s = k and (ii) s < k. If s = k, then set αi = log(w ti) for i [k]. Our base case is fort = 1.

artificial intelligence, dselect-k, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

f5ac21cd0ef1b88e9848571aeb53551a-Paper.pdf

Neural Information Processing SystemsFeb-11-2026, 22:47:01 GMT

dselect-k, neural network, selector, (14 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > France (0.04)

Genre: Overview (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.96)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

Neural Information Processing SystemsDec-25-2025, 06:33:27 GMT

The Mixture-of-Experts (MoE) architecture is showing promising results in improving parameter sharing in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable sparse gate' to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation.

application, differentiable selection, dselect-k, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)

Add feedback

f5ac21cd0ef1b88e9848571aeb53551a-Supplemental.pdf

Neural Information Processing SystemsAug-18-2025, 22:02:25 GMT

Supplementary to "DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-T ask Learning" In MTL, deep learning-based architectures that perform soft-parameter sharing, i.e., share model parameters partially, are proving to be effective at exploiting both the commonalities and differences among tasks [ This approach is similar to static gating, but it does not support per-example gating. Moreover, the number of nonzeros cannot be directly controlled (in contrast to our gate). Next, we show Direction (II). From the definition of r ( .), the following holds: r (S (v)) The penalty described above is part of our TensorFlow implementation of DSelect-k. Note that the logistic function is re-scaled to be on the same scale as the smooth-step function.Figure B.1: The Smooth-step ( γ = 1) and Logistic functions.

dselect-k, jaccard index, moe, (15 more...)

Neural Information Processing Systems

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

f5ac21cd0ef1b88e9848571aeb53551a-Paper.pdf

Neural Information Processing SystemsAug-18-2025, 22:02:21 GMT

artificial intelligence, dselect-k, machine learning, (16 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > France (0.04)

Genre: Overview (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.96)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

Neural Information Processing SystemsJan-19-2025, 14:06:14 GMT

The Mixture-of-Experts (MoE) architecture is showing promising results in improving parameter sharing in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable "sparse gate'" to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation.

differentiable selection, dselect-k, multi-task learning, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)

Add feedback

DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

Hazimeh, Hussein, Zhao, Zhe, Chowdhery, Aakanksha, Sathiamoorthy, Maheswaran, Chen, Yihua, Mazumder, Rahul, Hong, Lichan, Chi, Ed H.

arXiv.org Machine LearningJun-9-2021

The Mixture-of-experts (MoE) architecture is showing promising results in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable sparse gate to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: the first, continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. Our gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k in the context of MTL, on both synthetic and real datasets with up to 128 tasks. Our experiments indicate that MoE models based on DSelect-k can achieve statistically significant improvements in predictive and expert selection performance. Notably, on a real-world large-scale recommender system, DSelect-k achieves over 22% average improvement in predictive performance compared to the Top-k gate. We provide an open-source TensorFlow implementation of our gate.

dataset, dselect-k, moe, (14 more...)

arXiv.org Machine Learning

2106.0376

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > France (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback