AITopics | Mazumder, Rahul

Collaborating Authors

Mazumder, Rahul

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

Liu, Zirui, Song, Qingquan, Xiao, Qiang Charles, Selvaraj, Sathiya Keerthi, Mazumder, Rahul, Gupta, Aman, Hu, Xia

arXiv.org Artificial IntelligenceJan-8-2024

The large number of parameters in Pretrained Language Models enhance their performance, but also make them resource-intensive, making it challenging to deploy them on commodity hardware like a single GPU. Due to the memory and power limitations of these devices, model compression techniques are often used to decrease both the model's size and its inference latency. This usually results in a trade-off between model accuracy and efficiency. Therefore, optimizing this balance is essential for effectively deploying LLMs on commodity hardware. A significant portion of the efficiency challenge is the Feed-forward network (FFN) component, which accounts for roughly $\frac{2}{3}$ total parameters and inference latency. In this paper, we first observe that only a few neurons of FFN module have large output norm for any input tokens, a.k.a. heavy hitters, while the others are sparsely triggered by different tokens. Based on this observation, we explicitly split the FFN into two parts according to the heavy hitters. We improve the efficiency-accuracy trade-off of existing compression methods by allocating more resource to FFN parts with heavy hitters. In practice, our method can reduce model size by 43.1\% and bring $1.25\sim1.56\times$ wall clock time speedup on different hardware with negligible accuracy drop.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2401.04044

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.85)

Add feedback

QuantEase: Optimization-based Quantization for Language Models

Behdin, Kayhan, Acharya, Ayan, Gupta, Aman, Song, Qingquan, Zhu, Siyu, Keerthi, Sathiya, Mazumder, Rahul

arXiv.org Machine LearningDec-1-2023

With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in compression techniques that enable their efficient deployment. This study focuses on the Post-Training Quantization (PTQ) of LLMs. Drawing from recent advances, our work introduces QuantEase, a layer-wise quantization framework where individual layers undergo separate quantization. The problem is framed as a discrete-structured non-convex optimization, prompting the development of algorithms rooted in Coordinate Descent (CD) techniques. These CD-based methods provide high-quality solutions to the complex non-convex layer-wise quantization problems. Notably, our CD-based approach features straightforward updates, relying solely on matrix and vector operations, circumventing the need for matrix inversion or decomposition. We also explore an outlier-aware variant of our approach, allowing for retaining significant weights (outliers) with complete precision. Our proposal attains state-of-the-art performance in terms of perplexity and zero-shot accuracy in empirical evaluations across various LLMs and datasets, with relative improvements up to 15% over methods such as GPTQ. Leveraging careful linear algebra optimizations, QuantEase can quantize models like Falcon-180B on a single NVIDIA A100 GPU in $\sim$3 hours. Particularly noteworthy is our outlier-aware algorithm's capability to achieve near or sub-3-bit quantization of LLMs with an acceptable drop in accuracy, obviating the need for non-uniform quantization or grouping techniques, improving upon methods such as SpQR by up to two times in terms of perplexity.

machine learning, natural language, quantease, (19 more...)

arXiv.org Machine Learning

2309.01885

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

End-to-end Feature Selection Approach for Learning Skinny Trees

Ibrahim, Shibal, Behdin, Kayhan, Mazumder, Rahul

arXiv.org Artificial IntelligenceOct-27-2023

Joint feature selection and tree ensemble learning is a challenging task. Popular tree ensemble toolkits e.g., Gradient Boosted Trees and Random Forests support feature selection post-training based on feature importances, which are known to be misleading, and can significantly hurt performance. We propose Skinny Trees: a toolkit for feature selection in tree ensembles, such that feature selection and tree ensemble learning occurs simultaneously. It is based on an end-to-end optimization approach that considers feature selection in differentiable trees with Group $\ell_0 - \ell_2$ regularization. We optimize with a first-order proximal method and present convergence guarantees for a non-convex and non-smooth objective. Interestingly, dense-to-sparse regularization scheduling can lead to more expressive and sparser tree ensembles than vanilla proximal method. On 15 synthetic and real-world datasets, Skinny Trees can achieve $1.5\times$ - $620\times$ feature compression rates, leading up to $10\times$ faster inference over dense trees, without any loss in performance. Skinny Trees lead to superior feature selection than many existing toolkits e.g., in terms of AUC performance for $25\%$ feature budget, Skinny Trees outperforms LightGBM by $10.2\%$ (up to $37.7\%$), and Random Forests by $3\%$ (up to $12.5\%$).

artificial intelligence, machine learning, selection, (17 more...)

arXiv.org Artificial Intelligence

2310.18542

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

On the Convergence of CART under Sufficient Impurity Decrease Condition

Mazumder, Rahul, Wang, Haoyue

arXiv.org Machine LearningOct-25-2023

The decision tree is a flexible machine learning model that finds its success in numerous applications. It is usually fitted in a recursively greedy manner using CART. In this paper, we investigate the convergence rate of CART under a regression setting. First, we establish an upper bound on the prediction error of CART under a sufficient impurity decrease (SID) condition \cite{chi2022asymptotic} -- our result improves upon the known result by \cite{chi2022asymptotic} under a similar assumption. Furthermore, we provide examples that demonstrate the error bound cannot be further improved by more than a constant or a logarithmic factor. Second, we introduce a set of easily verifiable sufficient conditions for the SID condition. Specifically, we demonstrate that the SID condition can be satisfied in the case of an additive model, provided that the component functions adhere to a ``locally reverse Poincar{\'e} inequality". We discuss several well-known function classes in non-parametric estimation to illustrate the practical utility of this concept.

artificial intelligence, inequality, machine learning, (18 more...)

arXiv.org Machine Learning

2310.17114

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)

Add feedback

mSAM: Micro-Batch-Averaged Sharpness-Aware Minimization

Behdin, Kayhan, Song, Qingquan, Gupta, Aman, Keerthi, Sathiya, Acharya, Ayan, Ocejo, Borja, Dexter, Gregory, Khanna, Rajiv, Durfee, David, Mazumder, Rahul

arXiv.org Machine LearningSep-30-2023

Modern deep learning models are over-parameterized, where different optima can result in widely varying generalization performance. The Sharpness-Aware Minimization (SAM) technique modifies the fundamental loss function that steers gradient descent methods toward flatter minima, which are believed to exhibit enhanced generalization prowess. Our study delves into a specific variant of SAM known as micro-batch SAM (mSAM). This variation involves aggregating updates derived from adversarial perturbations across multiple shards (micro-batches) of a mini-batch during training. We extend a recently developed and well-studied general framework for flatness analysis to theoretically show that SAM achieves flatter minima than SGD, and mSAM achieves even flatter minima than SAM. We provide a thorough empirical evaluation of various image classification and natural language processing tasks to substantiate this theoretical advancement. We also show that contrary to previous work, mSAM can be implemented in a flexible and parallelizable manner without significantly increasing computational costs. Our implementation of mSAM yields superior generalization performance across a wide range of tasks compared to SAM, further supporting our theoretical framework.

machine learning, msam, natural language, (17 more...)

arXiv.org Machine Learning

2302.09693

Country:

North America > United States (0.14)
Europe > Belgium (0.14)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Sparse Gaussian Graphical Models with Discrete Optimization: Computational and Statistical Perspectives

Behdin, Kayhan, Chen, Wenyu, Mazumder, Rahul

arXiv.org Artificial IntelligenceJul-18-2023

We consider the problem of learning a sparse graph underlying an undirected Gaussian graphical model, a key problem in statistical machine learning. Given $n$ samples from a multivariate Gaussian distribution with $p$ variables, the goal is to estimate the $p \times p$ inverse covariance matrix (aka precision matrix), assuming it is sparse (i.e., has a few nonzero entries). We propose GraphL0BnB, a new estimator based on an $\ell_0$-penalized version of the pseudolikelihood function, while most earlier approaches are based on the $\ell_1$-relaxation. Our estimator can be formulated as a convex mixed integer program (MIP) which can be difficult to compute at scale using off-the-shelf commercial solvers. To solve the MIP, we propose a custom nonlinear branch-and-bound (BnB) framework that solves node relaxations with tailored first-order methods. As a by-product of our BnB framework, we propose large-scale solvers for obtaining good primal solutions that are of independent interest. We derive novel statistical guarantees (estimation and variable selection) for our estimator and discuss how our approach improves upon existing estimators. Our numerical experiments on real/synthetic datasets suggest that our method can solve, to near-optimality, problem instances with $p = 10^4$ -- corresponding to a symmetric matrix of size $p \times p$ with $p^2/2$ binary variables. We demonstrate the usefulness of GraphL0BnB versus various state-of-the-art approaches on a range of datasets.

artificial intelligence, machine learning, survey article, (17 more...)

arXiv.org Artificial Intelligence

2307.09366

Country: North America > United States > Massachusetts (0.14)

Genre:

Research Report > Promising Solution (0.65)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

FIRE: An Optimization Approach for Fast Interpretable Rule Extraction

Liu, Brian, Mazumder, Rahul

arXiv.org Artificial IntelligenceJun-12-2023

We present FIRE, Fast Interpretable Rule Extraction, an optimization-based framework to extract a small but useful collection of decision rules from tree ensembles. FIRE selects sparse representative subsets of rules from tree ensembles, that are easy for a practitioner to examine. To further enhance the interpretability of the extracted model, FIRE encourages fusing rules during selection, so that many of the selected decision rules share common antecedents. The optimization framework utilizes a fusion regularization penalty to accomplish this, along with a non-convex sparsity-inducing penalty to aggressively select rules. Optimization problems in FIRE pose a challenge to off-the-shelf solvers due to problem scale and the non-convexity of the penalties. To address this, making use of problem-structure, we develop a specialized solver based on block coordinate descent principles; our solver performs up to 40x faster than existing solvers. We show in our experiments that FIRE outperforms state-of-the-art rule ensemble algorithms at building sparse rule sets, and can deliver more interpretable models compared to existing methods.

artificial intelligence, ensemble, optimization problem, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3580305.3599353

2306.07432

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)

Add feedback

COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search

Ibrahim, Shibal, Chen, Wenyu, Hazimeh, Hussein, Ponomareva, Natalia, Zhao, Zhe, Mazumder, Rahul

arXiv.org Artificial IntelligenceJun-5-2023

The sparse Mixture-of-Experts (Sparse-MoE) framework efficiently scales up model capacity in various domains, such as natural language processing and vision. Sparse-MoEs select a subset of the "experts" (thus, only a portion of the overall network) for each input sample using a sparse, trainable gate. Existing sparse gates are prone to convergence and performance issues when training with first-order optimization methods. In this paper, we introduce two improvements to current MoE approaches. First, we propose a new sparse gate: COMET, which relies on a novel tree-based mechanism. COMET is differentiable, can exploit sparsity to speed up computation, and outperforms state-of-the-art gates. Second, due to the challenging combinatorial nature of sparse expert selection, first-order methods are typically prone to low-quality solutions. To deal with this challenge, we propose a novel, permutation-based local search method that can complement first-order methods in training any sparse gate, e.g., Hash routing, Top-k, DSelect-k, and COMET. We show that local search can help networks escape bad initializations or solutions. We performed large-scale experiments on various domains, including recommender systems, vision, and natural language processing. On standard vision and recommender systems benchmarks, COMET+ (COMET with local search) achieves up to 13% improvement in ROC AUC over popular gates, e.g., Hash routing and Top-k, and up to 9% over prior differentiable gates e.g., DSelect-k. When Top-k and Hash gates are combined with local search, we see up to $100\times$ reduction in the budget needed for hyperparameter tuning. Moreover, for language modeling, our approach improves over the state-of-the-art MoEBERT model for distilling BERT on 5/7 GLUE benchmarks as well as SQuAD dataset.

comet, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3580305.3599278

2306.02824

Country:

Asia (0.69)
North America > United States > California > Los Angeles County > Long Beach (0.15)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > Washington > King County > Seattle (0.14)

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Matrix Completion from General Deterministic Sampling Patterns

Lee, Hanbyul, Mazumder, Rahul, Song, Qifan, Honorio, Jean

arXiv.org Artificial IntelligenceJun-4-2023

Low-rank matrix completion is to exactly or approximately recover an underlying rank-r matrix from a small number of observed entries of the matrix. It has received much attention in a wide range of applications including collaborative filtering [Goldberg et al., 1992], phase retrieval [Candes et al., 2015] and image processing [Chen and Suter, 2004]. In research on establishing provable guarantees for low-rank matrix completion methods, typical assumptions considered are as follows: first, the underlying matrix is incoherent; second, observable matrix entries are sampled according to a probabilistic (usually uniform) model. However, the latter assumption is easily violated in numerous situations; it is unlikely realistic for the sampling patterns to be uniformly at random outside experimental settings, and it may not be even reasonable to model the sampling patterns as random. With this motivation, we aim to tackle the low-rank matrix completion problem without imposing any model assumptions on the sampling patterns. Even though there have been several works on non-random sampling schemes, they have imposed additional structural assumptions on the sampling pattern which are also not applicable to many real-world scenarios. For example, Heiman et al. [2014], Bhojanapalli and Jain [2014] and Burnwal and Vidyasagar [2020] assumed that the number of observed entries is the same for each row and column of the matrix, and Bishop and Yu [2014] introduced systematic assumptions on the subsets of the observed entries.

artificial intelligence, machine learning, matrix, (17 more...)

arXiv.org Artificial Intelligence

2306.02283

Genre: Research Report (0.64)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (0.54)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.48)
Information Technology > Artificial Intelligence > Machine Learning (0.46)

Add feedback

Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance

Ibrahim, Shibal, Ponomareva, Natalia, Mazumder, Rahul

arXiv.org Artificial IntelligenceMay-26-2023

Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular for improved prediction and efficient use of limited resources. Fine-tuning requires identification of best models to transfer-learn from and quantifying transferability prevents expensive re-training on all of the candidate models/tasks pairs. In this paper, we show that the statistical problems with covariance estimation drive the poor performance of H-score -- a common baseline for newer metrics -- and propose shrinkage-based estimator. This results in up to 80% absolute gain in H-score correlation performance, making it competitive with the state-of-the-art LogME measure. Our shrinkage-based H-score is $3\times$-10$\times$ faster to compute compared to LogME. Additionally, we look into a less common setting of target (as opposed to source) task selection. We demonstrate previously overlooked problems in such settings with different number of labels, class-imbalance ratios etc. for some recent metrics e.g., NCE, LEEP that resulted in them being misrepresented as leading measures. We propose a correction and recommend measuring correlation performance against relative accuracy in such settings. We support our findings with ~164,000 (fine-tuning trials) experiments on both vision models and graph neural networks.

artificial intelligence, machine learning, transferability measure, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-26387-3_42

2110.06893

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)

Genre:

Research Report > New Finding (0.66)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback