AITopics | Park, Jongho

Plotting

Park, Jongho

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

Kim, Junhyuck, Park, Jongho, Cho, Jaewoong, Papailiopoulos, Dimitris

arXiv.org Artificial IntelligenceDec-11-2024

We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of 4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursuit for sparse approximation, Lexico achieves flexible compression ratios through direct sparsity control. Lexico maintains 90-95% of the original performance while using only 15-25% of the full KV-cache memory, outperforming both quantization and token eviction methods. Notably, Lexico remains effective in low memory regimes where 2-bit quantization fails, achieving up to 1.7 better compression on LongBench and GSM8K while maintaining high accuracy. Figure 1: Memory usage vs. performance of Lexico compared to other key-value (KV) cache compression methods on GSM8K. The figure illustrates the relationship between KV cache size and the performance of Lexico on Llama models on GSM8K 5-shot evaluation. Lexico consistently outperforms both eviction-based methods (SnapKV, PyramidKV) and quantization-based methods (per-token quantization, KIVI, ZipCache). Transformers (Vaswani et al., 2017) have become the backbone of frontier Large Language Models (LLMs), driving progress in domains beyond natural language processing. However, Transformers are typically limited by their significant memory requirements. This stems not only from the large number of model parameters, but also from the having to maintain the KV cache that grows proportional to the model size (i.e., the number of layers, heads, and also embedding dimension) and token length of the input.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.0889

Country: North America > United States > Wisconsin (0.14)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Task Diversity Shortens the ICL Plateau

Kim, Jaeyeon, Kwon, Sehyun, Choi, Joo Young, Park, Jongho, Cho, Jaewoong, Lee, Jason D., Ryu, Ernest K.

arXiv.org Artificial IntelligenceOct-7-2024

In-context learning (ICL) describes a language model's ability to generate outputs based on a set of input demonstrations and a subsequent query. To understand this remarkable capability, researchers have studied simplified, stylized models. These studies have consistently observed long loss plateaus, during which models exhibit minimal improvement, followed by a sudden, rapid surge of learning. In this work, we reveal that training on multiple diverse ICL tasks simultaneously shortens the loss plateaus, making each task easier to learn. This finding is surprising as it contradicts the natural intuition that the combined complexity of multiple ICL tasks would lengthen the learning process, not shorten it. Our result suggests that the recent success in large-scale training of language models may be attributed not only to the richness of the data at scale but also to the easier optimization (training) induced by the diversity of natural language training data. Figure 1: We train a transformer from scratch on in-context learning tasks. Single-task ICL: Training loss () and test error/accuracy () when each task is trained individually. The Parity task cannot be learned even after 1000k training steps. Multi-task ICL: Training loss () and test error/accuracy () when all six tasks are trained simultaneously. Green lines mark the plateau escape points. In-context learning (ICL), first reported by Brown et al. (2020) with GPT-3, describes a language model's ability to generate outputs based on a set of input demonstrations and a subsequent query.

icl task, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.05448

Country: Asia (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks

Park, Jongho, Park, Jaeseung, Xiong, Zheyang, Lee, Nayoung, Cho, Jaewoong, Oymak, Samet, Lee, Kangwook, Papailiopoulos, Dimitris

arXiv.org Artificial IntelligenceFeb-6-2024

State-space models (SSMs), such as Mamba Gu & Dao (2034), have been proposed as alternatives to Transformer networks in language modeling, by incorporating gating, convolutions, and input-dependent token selection to mitigate the quadratic cost of multi-head attention. Although SSMs exhibit competitive performance, their in-context learning (ICL) capabilities, a remarkable emergent property of modern language models that enables task execution without parameter optimization, remain underexplored compared to Transformers. In this study, we evaluate the ICL performance of SSMs, focusing on Mamba, against Transformer models across various tasks. Our results show that SSMs perform comparably to Transformers in standard regression ICL tasks, while outperforming them in tasks like sparse parity learning. However, SSMs fall short in tasks involving non-standard retrieval functionality. To address these limitations, we introduce a hybrid model, \variant, that combines Mamba with attention blocks, surpassing individual models in tasks where they struggle independently. Our findings suggest that hybrid architectures offer promising avenues for enhancing ICL in language models.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2402.04248

Country:

North America > United States > Wisconsin (0.14)
North America > United States > Michigan (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

DualFL: A Duality-based Federated Learning Algorithm with Communication Acceleration in the General Convex Regime

Park, Jongho, Xu, Jinchao

arXiv.org Artificial IntelligenceJan-10-2024

We propose a new training algorithm, named DualFL (Dualized Federated Learning), for solving distributed optimization problems in federated learning. DualFL achieves communication acceleration for very general convex cost functions, thereby providing a solution to an open theoretical problem in federated learning concerning cost functions that may not be smooth nor strongly convex. We provide a detailed analysis for the local iteration complexity of DualFL to ensure the overall computational efficiency of DualFL. Furthermore, we introduce a completely new approach for the convergence analysis of federated learning based on a dual formulation. This new technique enables concise and elegant analysis, which contrasts the complex calculations used in existing literature on convergence of federated learning algorithms.

artificial intelligence, dualfl, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2305.10294

Country:

North America > United States > Pennsylvania (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Balanced Group Convolution: An Improved Group Convolution Based on Approximability Estimates

Lee, Youngkyu, Park, Jongho, Lee, Chang-Ock

arXiv.org Artificial IntelligenceOct-19-2023

The performance of neural networks has been significantly improved by increasing the number of channels in convolutional layers. However, this increase in performance comes with a higher computational cost, resulting in numerous studies focused on reducing it. One promising approach to address this issue is group convolution, which effectively reduces the computational cost by grouping channels. However, to the best of our knowledge, there has been no theoretical analysis on how well the group convolution approximates the standard convolution. In this paper, we mathematically analyze the approximation of the group convolution to the standard convolution with respect to the number of groups. Furthermore, we propose a novel variant of the group convolution called balanced group convolution, which shows a higher approximation with a small additional computational cost. We provide experimental results that validate our theoretical findings and demonstrate the superior performance of the balanced group convolution over other variants of group convolution.

artificial intelligence, convolution, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2310.12461

Country: North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Distribution-Independent Regression for Generalized Linear Models with Oblivious Corruptions

Diakonikolas, Ilias, Karmalkar, Sushrut, Park, Jongho, Tzamos, Christos

arXiv.org Machine LearningSep-27-2023

We demonstrate the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We assume we have sample access to examples $(x, y)$ where $y$ is a noisy measurement of $g(w^* \cdot x)$. In particular, \new{the noisy labels are of the form} $y = g(w^* \cdot x) + \xi + \epsilon$, where $\xi$ is the oblivious noise drawn independently of $x$ \new{and satisfies} $\Pr[\xi = 0] \geq o(1)$, and $\epsilon \sim \mathcal N(0, \sigma^2)$. Our goal is to accurately recover a \new{parameter vector $w$ such that the} function $g(w \cdot x)$ \new{has} arbitrarily small error when compared to the true values $g(w^* \cdot x)$, rather than the noisy measurements $y$. We present an algorithm that tackles \new{this} problem in its most general distribution-independent setting, where the solution may not \new{even} be identifiable. \new{Our} algorithm returns \new{an accurate estimate of} the solution if it is identifiable, and otherwise returns a small list of candidates, one of which is close to the true solution. Furthermore, we \new{provide} a necessary and sufficient condition for identifiability, which holds in broad settings. \new{Specifically,} the problem is identifiable when the quantile at which $\xi + \epsilon = 0$ is known, or when the family of hypotheses does not contain candidates that are nearly equal to a translated $g(w^* \cdot x) + A$ for some real number $A$, while also having large error when compared to $g(w^* \cdot x)$. This is the first \new{algorithmic} result for GLM regression \new{with oblivious noise} which can handle more than half the samples being arbitrarily corrupted. Prior work focused largely on the setting of linear regression, and gave algorithms under restrictive assumptions.

artificial intelligence, machine learning, regression, (17 more...)

arXiv.org Machine Learning

2309.11657

Country: North America > United States > Wisconsin (0.14)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)

Add feedback

Prompted LLMs as Chatbot Modules for Long Open-domain Conversation

Lee, Gibbeum, Hartmann, Volker, Park, Jongho, Papailiopoulos, Dimitris, Lee, Kangwook

arXiv.org Artificial IntelligenceMay-8-2023

In this paper, we propose MPC (Modular Prompted Chatbot), a new approach for creating high-quality conversational agents without the need for fine-tuning. Our method utilizes pre-trained large language models (LLMs) as individual modules for long-term consistency and flexibility, by using techniques such as few-shot prompting, chain-of-thought (CoT), and external memory. Our human evaluation results show that MPC is on par with fine-tuned chatbot models in open-domain conversations, making it an effective solution for creating consistent and engaging chatbots.

artificial intelligence, chatbot, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2023.findings-acl.277

2305.04533

Country: North America > United States > Wisconsin (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Education (1.00)
Leisure & Entertainment (0.93)
Media (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

ReLU Regression with Massart Noise

Diakonikolas, Ilias, Park, Jongho, Tzamos, Christos

arXiv.org Machine LearningSep-9-2021

We study the fundamental problem of ReLU regression, where the goal is to fit Rectified Linear Units (ReLUs) to data. This supervised learning task is efficiently solvable in the realizable setting, but is known to be computationally hard with adversarial label noise. In this work, we focus on ReLU regression in the Massart noise model, a natural and well-studied semi-random noise model. In this model, the label of every point is generated according to a function in the class, but an adversary is allowed to change this value arbitrarily with some probability, which is {\em at most} $\eta < 1/2$. We develop an efficient algorithm that achieves exact parameter recovery in this model under mild anti-concentration assumptions on the underlying distribution. Such assumptions are necessary for exact recovery to be information-theoretically possible. We demonstrate that our algorithm significantly outperforms naive applications of $\ell_1$ and $\ell_2$ regression on both synthetic and real data.

artificial intelligence, machine learning, transformation, (19 more...)

arXiv.org Machine Learning

2109.04623

Country:

North America > United States > Wisconsin (0.14)
North America > United States > New York (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science (0.93)

Add feedback

Parareal Neural Networks Emulating a Parallel-in-time Algorithm

Lee, Chang-Ock, Lee, Youngkyu, Park, Jongho

arXiv.org Artificial IntelligenceMar-15-2021

As deep neural networks (DNNs) become deeper, the training time increases. In this perspective, multi-GPU parallel computing has become a key tool in accelerating the training of DNNs. In this paper, we introduce a novel methodology to construct a parallel neural network that can utilize multiple GPUs simultaneously from a given DNN. We observe that layers of DNN can be interpreted as the time steps of a time-dependent problem and can be parallelized by emulating a parallel-in-time algorithm called parareal. The parareal algorithm consists of fine structures which can be implemented in parallel and a coarse structure which gives suitable approximations to the fine structures. By emulating it, the layers of DNN are torn to form a parallel structure, which is connected using a suitable coarse network. We report accelerated and accuracy-preserved results of the proposed methodology applied to VGG-16 and ResNet-1001 on several datasets.

artificial intelligence, machine learning, neural network, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TNNLS.2021.3072209

2103.08802

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback