AITopics | Kobyzev, Ivan

Collaborating Authors

Kobyzev, Ivan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ReGLA: Refining Gated Linear Attention

Lu, Peng, Kobyzev, Ivan, Rezagholizadeh, Mehdi, Chen, Boxing, Langlais, Philippe

arXiv.org Artificial IntelligenceFeb-5-2025

Recent advancements in Large Language Models (LLMs) have set themselves apart with their exceptional performance in complex language modelling tasks. However, these models are also known for their significant computational and storage requirements, primarily due to the quadratic computation complexity of softmax attention. To mitigate this issue, linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers. In this work, we embarked on a comprehensive exploration of three key components that substantially impact the performance of the Gated Linear Attention module: feature maps, normalization, and the gating mechanism. We developed a feature mapping function to address some crucial issues that previous suggestions overlooked. Then we offered further rationale for the integration of normalization layers to stabilize the training process. Moreover, we explored the saturation phenomenon of the gating mechanism and augmented it with a refining module. We conducted extensive experiments and showed our architecture outperforms previous Gated Linear Attention mechanisms in extensive tasks including training from scratch and post-linearization with continual pre-training.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2502.01578

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Hawaii (0.14)
Europe > Austria > Vienna (0.14)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity

Metel, Michael R., Lu, Peng, Chen, Boxing, Rezagholizadeh, Mehdi, Kobyzev, Ivan

arXiv.org Artificial IntelligenceOct-1-2024

We present a simple on the fly method for faster inference of large language models. Unlike other (self-)speculative decoding techniques, our method does not require fine-tuning or black-box optimization to generate a fixed draft model, relying instead on simple rules to generate varying draft models adapted to the input context. We show empirically that our light-weight algorithm is competitive with the current SOTA for self-speculative decoding, while being a truly plug-and-play method.

draft model, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2410.01028

Country: North America > Canada (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.43)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Resonance RoPE: Improving Context Length Generalization of Large Language Models

Wang, Suyuchen, Kobyzev, Ivan, Lu, Peng, Rezagholizadeh, Mehdi, Liu, Bang

arXiv.org Artificial IntelligenceJun-10-2024

This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE), where models pre-trained on shorter sequences face difficulty with out-of-distribution (OOD) token positions in longer sequences. We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios by refining the interpolation of RoPE features for OOD positions, significantly improving the model performance without additional online computational costs. Furthermore, we present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios, aiming to isolate the constantly increasing difficulty of token generation on long contexts from the challenges of recognizing new token positions. Our experiments on synthetic tasks show that after applying Resonance RoPE, Transformers recognize OOD position better and more robustly. Our extensive LLM experiments also show superior performance after applying Resonance RoPE to the current state-of-the-art RoPE scaling method, YaRN, on both upstream language modeling tasks and a variety of downstream long-text applications.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2403.00071

Country:

Asia (0.28)
North America > Canada > Quebec (0.14)

Genre: Research Report > Promising Solution (0.34)

Industry: Energy (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

OTTAWA: Optimal TransporT Adaptive Word Aligner for Hallucination and Omission Translation Errors Detection

Huang, Chenyang, Ghaddar, Abbas, Kobyzev, Ivan, Rezagholizadeh, Mehdi, Zaiane, Osmar R., Chen, Boxing

arXiv.org Artificial IntelligenceJun-3-2024

Recently, there has been considerable attention on detecting hallucinations and omissions in Machine Translation (MT) systems. The two dominant approaches to tackle this task involve analyzing the MT system's internal states or relying on the output of external tools, such as sentence similarity or MT quality estimators. In this work, we introduce OTTAWA, a novel Optimal Transport (OT)-based word aligner specifically designed to enhance the detection of hallucinations and omissions in MT systems. Our approach explicitly models the missing alignments by introducing a "null" vector, for which we propose a novel one-side constrained OT setting to allow an adaptive null alignment. Our approach yields competitive results compared to state-of-the-art methods across 18 language pairs on the HalOmi benchmark. In addition, it shows promising features, such as the ability to distinguish between both error types and perform word-level detection without accessing the MT system's internal states.

artificial intelligence, computational linguistic, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.01919

Country: North America > Canada > Alberta (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Hyperparameter Optimization for Large Language Model Instruction-Tuning

Tribes, Christophe, Benarroch-Lelong, Sacha, Lu, Peng, Kobyzev, Ivan

arXiv.org Artificial IntelligenceJan-30-2024

The fine-tuning of Large Language Models (LLMs) has enabled them to recently achieve milestones in natural language processing applications. The emergence of ever larger LLMs has paved the way for more efficient fine-tuning methods. Among these, the Low-Rank Adaptation (LoRA) method keeps most of the weights of the pre-trained LLM frozen while introducing a low-rank decomposition of the weight matrix, enabling the tuning of only a very small proportion of the network. The performance on downstream tasks of models fine-tuned with LoRA heavily relies on a set of hyperparameters including the rank of the decomposition. In this work, we investigate the choice of these hyperparameters through two main blackbox optimization (BBO) techniques. We examine the whole pipeline of performing fine-tuning and validation on a pre-trained LLM as a blackbox and efficiently explore the space of hyperparameters with the \nomad algorithm, achieving a boost in performance and human alignment of the tuned model.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2312.00949

Country:

North America > Canada > Quebec (0.14)
North America > United States > Hawaii (0.14)
North America > United States > California (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Attribute Controlled Dialogue Prompting

Liu, Runcheng, Rashid, Ahmad, Kobyzev, Ivan, Rezagholizadeh, Mehdi, Poupart, Pascal

arXiv.org Artificial IntelligenceJul-11-2023

Prompt-tuning has become an increasingly popular parameter-efficient method for adapting large pretrained language models to downstream tasks. However, both discrete prompting and continuous prompting assume fixed prompts for all data samples within a task, neglecting the fact that inputs vary greatly in some tasks such as open-domain dialogue generation. In this paper, we present a novel, instance-specific prompt-tuning algorithm for dialogue generation. Specifically, we generate prompts based on instance-level control code, rather than the conversation history, to explore their impact on controlled dialogue generation. Experiments on popular open-domain dialogue datasets, evaluated on both automated metrics and human evaluation, demonstrate that our method is superior to prompting baselines and comparable to fine-tuning with only 5%-6% of total parameters.

controlled dp, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2307.05228

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

LABO: Towards Learning Optimal Label Regularization via Bi-level Optimization

Lu, Peng, Rashid, Ahmad, Kobyzev, Ivan, Rezagholizadeh, Mehdi, Langlais, Philippe

arXiv.org Artificial IntelligenceMay-8-2023

Regularization techniques are crucial to improving the generalization performance and training efficiency of deep neural networks. Many deep learning algorithms rely on weight decay, dropout, batch/layer normalization to converge faster and generalize. Label Smoothing (LS) is another simple, versatile and efficient regularization which can be applied to various supervised classification tasks. Conventional LS, however, regardless of the training instance assumes that each non-target class is equally likely. In this work, we present a general framework for training with label regularization, which includes conventional LS but can also model instance-specific variants. Based on this formulation, we propose an efficient way of learning LAbel regularization by devising a Bi-level Optimization (LABO) problem. We derive a deterministic and interpretable solution of the inner loop as the optimal label smoothing without the need to store the parameters or the output of a trained model. Finally, we conduct extensive experiments and demonstrate our LABO consistently yields improvement over conventional label regularization on various fields, including seven machine translation and three image classification tasks across various

computational linguistic, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2305.04971

Country:

North America > United States (1.00)
Europe (1.00)
Asia (0.93)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation

Valipour, Mojtaba, Rezagholizadeh, Mehdi, Kobyzev, Ivan, Ghodsi, Ali

arXiv.org Artificial IntelligenceApr-19-2023

With the ever-growing size of pretrained models (PMs), fine-tuning them has become more expensive and resource-hungry. As a remedy, low-rank adapters (LoRA) keep the main pretrained weights of the model frozen and just introduce some learnable truncated SVD modules (so-called LoRA blocks) to the model. While LoRA blocks are parameter-efficient, they suffer from two major problems: first, the size of these blocks is fixed and cannot be modified after training (for example, if we need to change the rank of LoRA blocks, then we need to re-train them from scratch); second, optimizing their rank requires an exhaustive search and effort. In this work, we introduce a dynamic low-rank adaptation (DyLoRA) technique to address these two problems together. Our DyLoRA method trains LoRA blocks for a range of ranks instead of a single rank by sorting the representation learned by the adapter module at different ranks during training. We evaluate our solution on different natural language understanding (GLUE benchmark) and language generation tasks (E2E, DART and WebNLG) using different pretrained models such as RoBERTa and GPT with different sizes. Our results show that we can train dynamic search-free models with DyLoRA at least 4 to 7 times (depending to the task) faster than LoRA without significantly compromising performance. Moreover, our models can perform consistently well on a much larger range of ranks compared to LoRA.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2210.07558

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

Do we need Label Regularization to Fine-tune Pre-trained Language Models?

Kobyzev, Ivan, Jafari, Aref, Rezagholizadeh, Mehdi, Li, Tianda, Do-Omri, Alan, Lu, Peng, Poupart, Pascal, Ghodsi, Ali

arXiv.org Artificial IntelligenceApr-12-2023

Knowledge Distillation (KD) is a prominent neural model compression technique that heavily relies on teacher network predictions to guide the training of a student model. Considering the ever-growing size of pre-trained language models (PLMs), KD is often adopted in many NLP tasks involving PLMs. However, it is evident that in KD, deploying the teacher network during training adds to the memory and computational requirements of training. In the computer vision literature, the necessity of the teacher network is put under scrutiny by showing that KD is a label regularization technique that can be replaced with lighter teacher-free variants such as the label-smoothing technique. However, to the best of our knowledge, this issue is not investigated in NLP. Therefore, this work concerns studying different label regularization techniques and whether we actually need them to improve the fine-tuning of smaller PLM networks on downstream tasks. In this regard, we did a comprehensive set of experiments on different PLMs such as BERT, RoBERTa, and GPT with more than 600 distinct trials and ran each configuration five times. This investigation led to a surprising observation that KD and other label regularization techniques do not play any meaningful role over regular fine-tuning when the student model is pre-trained. We further explore this phenomenon in different settings of NLP and computer vision tasks and demonstrate that pre-training itself acts as a kind of regularization, and additional label regularization is unnecessary.

experiment, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2205.12428

Country:

North America > Canada (0.46)
Europe (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Education > Educational Technology > Educational Software (0.54)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Mathematical Challenges in Deep Learning

Nia, Vahid Partovi, Zhang, Guojun, Kobyzev, Ivan, Metel, Michael R., Li, Xinlin, Sun, Ke, Hemati, Sobhan, Asgharian, Masoud, Kong, Linglong, Liu, Wulong, Chen, Boxing

arXiv.org Artificial IntelligenceMar-24-2023

Deep models are dominating the artificial intelligence (AI) industry since the ImageNet challenge in 2012. The size of deep models is increasing ever since, which brings new challenges to this field with applications in cell phones, personal computers, autonomous cars, and wireless base stations. Here we list a set of problems, ranging from training, inference, generalization bound, and optimization with some formalism to communicate these challenges with mathematicians, statisticians, and theoretical computer scientists. This is a subjective view of the research questions in deep learning that benefits the tech industry in long run.

artificial intelligence, machine learning, mathematical challenge, (16 more...)

arXiv.org Artificial Intelligence

2303.15464

Country:

North America > United States (0.46)
North America > Canada > Quebec > Montreal (0.15)

Genre:

Research Report > New Finding (0.34)
Research Report > Experimental Study (0.34)

Industry:

Transportation > Ground > Road (0.34)
Information Technology > Robotics & Automation (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback