AITopics | Akbarian, Pedram

Collaborating Authors

Akbarian, Pedram

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Sigmoid Self-Attention is Better than Softmax Self-Attention: A Mixture-of-Experts Perspective

Yan, Fanqi, Nguyen, Huy, Akbarian, Pedram, Ho, Nhat, Rinaldo, Alessandro

arXiv.org Artificial IntelligenceJan-31-2025

Transformer models [54] have been known as the state-of-the-art architecture for a wide range of machine learning and deep learning applications, including language modeling [16, 3, 47, 51], computer vision [17, 4, 46, 35], and reinforcement learning [5, 31, 25], etc. One of the central components that contribute to the success of the Transformer models is the self-attention mechanism, which enables sequence-to-sequence models to concentrate on relevant parts of the input data. In particular, for each token in an input sequence, the self-attention mechanism computes a context vector formulated as a weighted sum of the tokens, where more relevant tokens to the context are assigned larger weights than others (see Section 2.1 for a formal definition). Therefore, self-attention is able to capture long-range dependencies and complex relationships within the data. However, since the weights in the context vector are normalized by the softmax function, there might be an undesirable competition among the tokens, that is, an increase in the weight of a token leads to a decrease in the weights of others. As a consequence, the traditional softmax self-attention mechanism might focus only on a few aspects of the data and possibly ignore other informative features [48]. Additionally, [22] also discovered that the tokens' inner dependence on the attention scores owing to the softmax normalization partly causes the attention sink phenomenon occurring

artificial intelligence, deep learning, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2502.00281

Country:

Asia (0.28)
Europe (0.27)
North America > United States > Texas (0.14)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Understanding Expert Structures on Minimax Parameter Estimation in Contaminated Mixture of Experts

Yan, Fanqi, Nguyen, Huy, Le, Dung, Akbarian, Pedram, Ho, Nhat

arXiv.org Machine LearningOct-16-2024

We conduct the convergence analysis of parameter estimation in the contaminated mixture of experts. This model is motivated from the prompt learning problem where ones utilize prompts, which can be formulated as experts, to fine-tune a large-scaled pre-trained model for learning downstream tasks. There are two fundamental challenges emerging from the analysis: (i) the proportion in the mixture of the pre-trained model and the prompt may converge to zero where the prompt vanishes during the training; (ii) the algebraic interaction among parameters of the pre-trained model and the prompt can occur via some partial differential equation and decelerate the prompt learning. In response, we introduce a distinguishability condition to control the previous parameter interaction. Additionally, we also consider various types of expert structures to understand their effects on the parameter estimation. In each scenario, we provide comprehensive convergence rates of parameter estimation along with the corresponding minimax lower bounds.

artificial intelligence, equation, machine learning, (18 more...)

arXiv.org Machine Learning

2410.12258

Country:

Asia (0.27)
North America > United States > Texas (0.14)

Genre: Research Report (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Quadratic Gating Functions in Mixture of Experts: A Statistical Insight

Akbarian, Pedram, Nguyen, Huy, Han, Xing, Ho, Nhat

arXiv.org Machine LearningOct-15-2024

Mixture of Experts (MoE) models are highly effective in scaling model capacity while preserving computational efficiency, with the gating network, or router, playing a central role by directing inputs to the appropriate experts. In this paper, we establish a novel connection between MoE frameworks and attention mechanisms, demonstrating how quadratic gating can serve as a more expressive and efficient alternative. Motivated by this insight, we explore the implementation of quadratic gating within MoE models, identifying a connection between the self-attention mechanism and the quadratic gating. We conduct a comprehensive theoretical analysis of the quadratic softmax gating MoE framework, showing improved sample efficiency in expert and parameter estimation. Our analysis provides key insights into optimal designs for quadratic gating and expert functions, further elucidating the principles behind widely used attention mechanisms. Through extensive evaluations, we demonstrate that the quadratic gating MoE outperforms the traditional linear gating MoE. Moreover, our theoretical insights have guided the development of a novel attention mechanism, which we validated through extensive experiments. The results demonstrate its favorable performance over conventional models across various tasks.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

2410.11222

Country:

North America > United States > Texas (0.14)
North America > United States > California (0.14)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Statistical Advantages of Perturbing Cosine Router in Sparse Mixture of Experts

Nguyen, Huy, Akbarian, Pedram, Pham, Trang, Nguyen, Trang, Zhang, Shujian, Ho, Nhat

arXiv.org Machine LearningMay-22-2024

The cosine router in sparse Mixture of Experts (MoE) has recently emerged as an attractive alternative to the conventional linear router. Indeed, the cosine router demonstrates favorable performance in image and language tasks and exhibits better ability to mitigate the representation collapse issue, which often leads to parameter redundancy and limited representation potentials. Despite its empirical success, a comprehensive analysis of the cosine router in sparse MoE has been lacking. Considering the least square estimation of the cosine routing sparse MoE, we demonstrate that due to the intrinsic interaction of the model parameters in the cosine router via some partial differential equations, regardless of the structures of the experts, the estimation rates of experts and model parameters can be as slow as $\mathcal{O}(1/\log^{\tau}(n))$ where $\tau > 0$ is some constant and $n$ is the sample size. Surprisingly, these pessimistic non-polynomial convergence rates can be circumvented by the widely used technique in practice to stabilize the cosine router -- simply adding noises to the $\mathbb{L}_{2}$ norms in the cosine router, which we refer to as \textit{perturbed cosine router}. Under the strongly identifiable settings of the expert functions, we prove that the estimation rates for both the experts and model parameters under the perturbed cosine routing sparse MoE are significantly improved to polynomial rates. Finally, we conduct extensive simulation studies in both synthetic and real data settings to empirically validate our theoretical results.

artificial intelligence, cosine router, machine learning, (16 more...)

arXiv.org Machine Learning

2405.14131

Country: North America > United States > Texas (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts?

Nguyen, Huy, Akbarian, Pedram, Ho, Nhat

arXiv.org Artificial IntelligenceJan-24-2024

Dense-to-sparse gating mixture of experts (MoE) has recently become an effective alternative to a well-known sparse MoE. Rather than fixing the number of activated experts as in the latter model, which could limit the investigation of potential experts, the former model utilizes the temperature to control the softmax weight distribution and the sparsity of the MoE during training in order to stabilize the expert specialization. Nevertheless, while there are previous attempts to theoretically comprehend the sparse MoE, a comprehensive analysis of the dense-to-sparse gating MoE has remained elusive. Therefore, we aim to explore the impacts of the dense-to-sparse gate on the maximum likelihood estimation under the Gaussian MoE in this paper. We demonstrate that due to interactions between the temperature and other model parameters via some partial differential equations, the convergence rates of parameter estimations are slower than any polynomial rates, and could be as slow as $\mathcal{O}(1/\log(n))$, where $n$ denotes the sample size. To address this issue, we propose using a novel activation dense-to-sparse gate, which routes the output of a linear layer to an activation function before delivering them to the softmax function. By imposing linearly independence conditions on the activation function and its derivatives, we show that the parameter estimation rates are significantly improved to polynomial rates.

artificial intelligence, exp, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2401.13875

Country: North America > United States > Texas (0.14)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.54)

Add feedback

A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts

Nguyen, Huy, Akbarian, Pedram, Nguyen, TrungTin, Ho, Nhat

arXiv.org Machine LearningOct-22-2023

Mixture-of-experts (MoE) model incorporates the power of multiple submodels via gating functions to achieve greater performance in numerous regression and classification applications. From a theoretical perspective, while there have been previous attempts to comprehend the behavior of that model under the regression settings through the convergence analysis of maximum likelihood estimation in the Gaussian MoE model, such analysis under the setting of a classification problem has remained missing in the literature. We close this gap by establishing the convergence rates of density estimation and parameter estimation in the softmax gating multinomial logistic MoE model. Notably, when part of the expert parameters vanish, these rates are shown to be slower than polynomial rates owing to an inherent interaction between the softmax gating and expert functions via partial differential equations. To address this issue, we propose using a novel class of modified softmax gating functions which transform the input value before delivering them to the gating functions. As a result, the previous interaction disappears and the parameter estimation rates are significantly improved.

artificial intelligence, exp, machine learning, (15 more...)

arXiv.org Machine Learning

2310.14188

Country:

Europe > France (0.14)
North America > United States > Texas (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)

Add feedback

Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts

Nguyen, Huy, Akbarian, Pedram, Yan, Fanqi, Ho, Nhat

arXiv.org Machine LearningSep-24-2023

Top-K sparse softmax gating mixture of experts has been widely used for scaling up massive deep-learning architectures without increasing the computational cost. Despite its popularity in real-world applications, the theoretical understanding of that gating function has remained an open problem. The main challenge comes from the structure of the top-K sparse softmax gating function, which partitions the input space into multiple regions with distinct behaviors. By focusing on a Gaussian mixture of experts, we establish theoretical results on the effects of the top-K sparse softmax gating function on both density and parameter estimations. Our results hinge upon defining novel loss functions among parameters to capture different behaviors of the input regions. When the true number of experts $k_{\ast}$ is known, we demonstrate that the convergence rates of density and parameter estimations are both parametric on the sample size. However, when $k_{\ast}$ becomes unknown and the true model is over-specified by a Gaussian mixture of $k$ experts where $k > k_{\ast}$, our findings suggest that the number of experts selected from the top-K sparse softmax gating function must exceed the total cardinality of a certain number of Voronoi cells associated with the true parameters to guarantee the convergence of the density estimation. Moreover, while the density estimation rate remains parametric under this setting, the parameter estimation rates become substantially slow due to an intrinsic interaction between the softmax gating and expert functions.

artificial intelligence, exp, machine learning, (16 more...)

arXiv.org Machine Learning

2309.1385

Country: North America > United States > Texas (0.14)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback