AITopics | Jitkrittum, Wittawat

Plotting

Jitkrittum, Wittawat

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

I Know What I Don't Know: Improving Model Cascades Through Confidence Tuning

Rabanser, Stephan, Rauschmayr, Nathalie, Kulshrestha, Achin, Poklukar, Petra, Jitkrittum, Wittawat, Augenstein, Sean, Wang, Congchao, Tombari, Federico

arXiv.org Artificial IntelligenceFeb-26-2025

Large-scale machine learning models deliver strong performance across a wide range of tasks but come with significant computational and resource constraints. To mitigate these challenges, local smaller models are often deployed alongside larger models, relying on routing and deferral mechanisms to offload complex tasks. However, existing approaches inadequately balance the capabilities of these models, often resulting in unnecessary deferrals or sub-optimal resource usage. In this work we introduce a novel loss function called Gatekeeper for calibrating smaller models in cascade setups. Our approach fine-tunes the smaller model to confidently handle tasks it can perform correctly while deferring complex tasks to the larger model. Moreover, it incorporates a mechanism for managing the trade-off between model performance and deferral accuracy, and is broadly applicable across various tasks and domains without any architectural changes. We evaluate our method on encoder-only, decoder-only, and encoder-decoder architectures. Experiments across image classification, language modeling, and vision-language tasks show that our approach substantially improves deferral performance.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2502.19335

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Arizona (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Universal Model Routing for Efficient LLM Inference

Jitkrittum, Wittawat, Narasimhan, Harikrishna, Rawat, Ankit Singh, Juneja, Jeevesh, Wang, Zifeng, Lee, Chen-Yu, Shenoy, Pradeep, Panigrahy, Rina, Menon, Aditya Krishna, Kumar, Sanjiv

arXiv.org Artificial IntelligenceFeb-12-2025

Large language models' significant advances in capabilities are accompanied by significant increases in inference costs. Model routing is a simple technique for reducing inference cost, wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective strategies, relying on cluster-based routing and a learned cluster map respectively. We prove that these strategies are estimates of a theoretically optimal routing rule, and provide an excess risk bound to quantify their errors. Experiments on a range of public benchmarks show the effectiveness of the proposed strategies in routing amongst more than 30 unseen LLMs.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.08773

Country:

Asia (0.92)
North America > Mexico > Mexico City (0.14)
North America > United States > California (0.14)
North America > United States > Arizona (0.14)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

Rawat, Ankit Singh, Sadhanala, Veeranjaneyulu, Rostamizadeh, Afshin, Chakrabarti, Ayan, Jitkrittum, Wittawat, Feinberg, Vladimir, Kim, Seungyeon, Harutyunyan, Hrayr, Saunshi, Nikunj, Nado, Zachary, Shivanna, Rakesh, Reddi, Sashank J., Menon, Aditya Krishna, Anil, Rohan, Kumar, Sanjiv

arXiv.org Artificial IntelligenceOct-24-2024

A primary challenge in large language model (LLM) development is their onerous pre-training cost. Typically, such pre-training involves optimizing a self-supervised objective (such as next-token prediction) over a large corpus. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by suitably leveraging a small language model (SLM). In particular, this paradigm relies on an SLM to both (1) provide soft labels as additional training supervision, and (2) select a small subset of valuable ("informative" and "hard") training examples. Put together, this enables an effective transfer of the SLM's predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to systematically study the utility of SLMs in enabling efficient training of high-quality LLMs. In particular, our framework characterizes how the SLM's seemingly low-quality supervision can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the SLM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of an LLM with 2.8B parameters by utilizing a smaller LM with 1.5B parameters on the Pile dataset.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.18779

Country:

Europe (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.82)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Faster Cascades via Speculative Decoding

Narasimhan, Harikrishna, Jitkrittum, Wittawat, Rawat, Ankit Singh, Kim, Seungyeon, Gupta, Neha, Menon, Aditya Krishna, Kumar, Sanjiv

arXiv.org Artificial IntelligenceMay-29-2024

Cascades and speculative decoding are two common approaches to improving language models' inference efficiency. Both approaches involve interleaving models of different sizes, but via fundamentally distinct mechanisms: cascades employ a deferral rule that invokes the larger model only for "hard" inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in parallel verification mode. These mechanisms offer different benefits: empirically, cascades are often capable of yielding better quality than even the larger model, while theoretically, speculative decoding offers a guarantee of quality-neutrality. In this paper, we leverage the best of both these approaches by designing new speculative cascading techniques that implement their deferral rule through speculative execution. We characterize the optimal deferral rule for our speculative cascades, and employ a plug-in approximation to the optimal rule. Through experiments with T5 models on benchmark language tasks, we show that the proposed approach yields better cost-quality trade-offs than cascading and speculative decoding baselines.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2405.19261

Country: North America > United States > Maryland (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Cascade-Aware Training of Language Models

Wang, Congchao, Augenstein, Sean, Rush, Keith, Jitkrittum, Wittawat, Narasimhan, Harikrishna, Rawat, Ankit Singh, Menon, Aditya Krishna, Go, Alec

arXiv.org Artificial IntelligenceMay-29-2024

Reducing serving cost and latency is a fundamental concern for the deployment of language models (LMs) in business applications. To address this, cascades of LMs offer an effective solution that conditionally employ smaller models for simpler queries. Cascaded systems are typically built with independently trained models, neglecting the advantages of considering inference-time interactions of the cascaded LMs during training. In this paper, we present cascade-aware training(CAT), an approach to optimizing the overall quality-cost performance tradeoff of a cascade of LMs. We achieve inference-time benefits by training the small LM with awareness of its place in a cascade and downstream capabilities. We demonstrate the value of the proposed method with over 60 LM tasks of the SuperGLUE, WMT22, and FLAN2021 datasets.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2406.0006

Country:

North America > United States > Texas (0.14)
North America > United States > California (0.14)
North America > United States > Arizona (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Add feedback

Language Model Cascades: Token-level uncertainty and beyond

Gupta, Neha, Narasimhan, Harikrishna, Jitkrittum, Wittawat, Rawat, Ankit Singh, Menon, Aditya Krishna, Kumar, Sanjiv

arXiv.org Artificial IntelligenceApr-15-2024

Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks, but at the expense of increased inference costs. Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs: here, a small model is invoked for most "easy" instances, while a few "hard" instances are deferred to the large model. While the principles underpinning cascading are well-studied for classification tasks - with deferral based on predicted class uncertainty favored theoretically and practically - a similar understanding is lacking for generative LM tasks. In this work, we initiate a systematic study of deferral rules for LM cascades. We begin by examining the natural extension of predicted class uncertainty to generative LM tasks, namely, the predicted sequence uncertainty. We show that this measure suffers from the length bias problem, either over- or under-emphasizing outputs based on their lengths. This is because LMs produce a sequence of uncertainty values, one for each output token; and moreover, the number of output tokens is variable across examples. To mitigate this issue, we propose to exploit the richer token-level uncertainty information implicit in generative LMs. We argue that naive predicted sequence uncertainty corresponds to a simple aggregation of these uncertainties. By contrast, we show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform such simple aggregation strategies, via experiments on a range of natural language benchmarks with FLAN-T5 models. We further show that incorporating embeddings from the smaller model and intermediate layers of the larger model can give an additional boost in the overall cost-quality tradeoff.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2404.10136

Country:

Europe (1.00)
Asia (0.67)
North America > United States > California (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.50)

Industry:

Education (0.67)
Media (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

It's an Alignment, Not a Trade-off: Revisiting Bias and Variance in Deep Models

Chen, Lin, Lukasik, Michal, Jitkrittum, Wittawat, You, Chong, Kumar, Sanjiv

arXiv.org Machine LearningOct-13-2023

The concepts of bias and variance, obtained from decomposing the generalization error, are of fundamental importance in machine learning. Classical wisdom suggests that there is a trade-off between bias and variance: models of low capacity have high bias and low variance, while models of high capacity have low bias and high variance. This understanding served as an important guiding principle for developing generalizable machine learning models, suggesting that they should be neither too large nor too small [Bishop, 2006]. Recently, a line of research found that deep models defy this classical wisdom [Belkin et al., 2019]: their variance curves exhibit a unimodal shape that first increases with model size, then decreases beyond the point that the models can perfectly fit the training data [Neal et al., 2018, Yang et al., 2020]. While the unimodal variance curve explains why over-parameterized deep models generalize well, there is still a lack of understanding on why it occurs. This paper revisits the study of bias and variance to understand their behavior in deep models. We perform a per-sample measurement of bias and variance in popular deep classification models. Our study reveals a curious phenomenon, which is radically different from the classical tradeoff perspective on bias-variance, while is concordant with more recent works [Belkin et al., 2019, Hastie et al., 2022, Mei and Montanari, 2022].

artificial intelligence, machine learning, variance, (20 more...)

arXiv.org Machine Learning

2310.0925

Country: North America > United States (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Plugin estimators for selective classification with out-of-distribution detection

Narasimhan, Harikrishna, Menon, Aditya Krishna, Jitkrittum, Wittawat, Kumar, Sanjiv

arXiv.org Artificial IntelligenceJul-24-2023

Real-world classifiers can benefit from the option of abstaining from predicting on samples where they have low confidence. Such abstention is particularly useful on samples which are close to the learned decision boundary, or which are outliers with respect to the training sample. These settings have been the subject of extensive but disjoint study in the selective classification (SC) and out-of-distribution (OOD) detection literature. Recent work on selective classification with OOD detection (SCOD) has argued for the unified study of these problems; however, the formal underpinnings of this problem are still nascent, and existing techniques are heuristic in nature. In this paper, we propose new plugin estimators for SCOD that are theoretically grounded, effective, and generalise existing approaches from the SC and OOD detection literature. In the course of our analysis, we formally explicate how na\"{i}ve use of existing SC and OOD detection baselines may be inadequate for SCOD. We empirically demonstrate that our approaches yields competitive SC and OOD detection performance compared to baselines from both literatures.

artificial intelligence, detection, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2301.12386

Country:

North America > United States > California (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

When Does Confidence-Based Cascade Deferral Suffice?

Jitkrittum, Wittawat, Gupta, Neha, Menon, Aditya Krishna, Narasimhan, Harikrishna, Rawat, Ankit Singh, Kumar, Sanjiv

arXiv.org Artificial IntelligenceJul-6-2023

Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. One simple deferral rule employs the confidence of the current classifier, e.g., based on the maximum predicted softmax probability. Despite being oblivious to the structure of the cascade -- e.g., not modelling the errors of downstream models -- such confidence-based deferral often works remarkably well in practice. In this paper, we seek to better understand the conditions under which confidence-based deferral may fail, and when alternate deferral strategies can perform better. We first present a theoretical characterisation of the optimal deferral rule, which precisely characterises settings under which confidence-based deferral may suffer. We then study post-hoc deferral mechanisms, and demonstrate they can significantly improve upon confidence-based deferral in settings where (i) downstream models are specialists that only work well on a subset of inputs, (ii) samples are subject to label noise, and (iii) there is distribution shift between the train and test set.

confidence-based deferral, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2307.02764

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(2 more...)

Add feedback

EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval

Kim, Seungyeon, Rawat, Ankit Singh, Zaheer, Manzil, Jayasumana, Sadeep, Sadhanala, Veeranjaneyulu, Jitkrittum, Wittawat, Menon, Aditya Krishna, Fergus, Rob, Kumar, Sanjiv

arXiv.org Artificial IntelligenceJul-3-2023

Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR). In this paper, we aim to improve distillation methods that pave the way for the resource-efficient deployment of such models in practice. Inspired by our theoretical analysis of the teacher-student generalization gap for IR models, we propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. Unlike existing teacher score-based distillation methods, our proposed approach employs embedding matching tasks to provide a stronger signal to align the representations of the teacher and student models. In addition, it utilizes query generation to explore the data manifold to reduce the discrepancies between the student and the teacher where training data is sparse. Furthermore, our analysis also motivates novel asymmetric architectures for student models which realizes better embedding alignment without increasing online inference cost. On standard benchmarks like MSMARCO, we show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.

distillation, information retrieval, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2301.12005

Country:

Europe (0.67)
Asia > Middle East > UAE (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.64)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback