AITopics | Onak, Krzysztof

Collaborating Authors

Onak, Krzysztof

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Compression Barriers for Autoregressive Transformers

Haris, Themistoklis, Onak, Krzysztof

arXiv.org Artificial IntelligenceFeb-21-2025

A key limitation of autoregressive Transformers is the large memory needed at inference-time to cache all previous key-value (KV) embeddings. Prior works address this by compressing the KV cache, but often assume specific structural properties of the embeddings. This raises the following natural question: Can truly sublinear space utilization be achieved without such assumptions? In this work, we answer this question in the negative. Any algorithm for attention-based token generation must use $\Theta(nd)$ space, where $n$ is the number of tokens generated so far and $d = \Omega(\log n)$ is the dimension of the KV embeddings. Our proof involves a reduction from a classic communication complexity problem and uses a randomized construction that leverages properties of projections in the spirit of the Johnson-Linderstrauss lemma. For the low-dimensional regime $d = o(\log n)$, we show that any algorithm requires $\Omega(d\cdot e^d)$ space and prove, using tight bounds on covering numbers, that SubGen, proposed by Zandieh, Han, Mirrokni and Karbasi, matches this bound. Further, we investigate how sparsity assumptions enable token generation in truly sublinear space, presenting impossibility results and proposing a new KV cache compression algorithm for sliding window attention when the value cache outside the window is unmasked. Finally, we analyze token generation's time complexity, using an indistinguishability argument to prove that no non-adaptive algorithm can compute attention online in sublinear time for all tokens.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.15955

Country: North America > United States (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Communication-Efficient Distributed Learning of Discrete Distributions

Diakonikolas, Ilias, Grigorescu, Elena, Li, Jerry, Natarajan, Abhiram, Onak, Krzysztof, Schmidt, Ludwig

Neural Information Processing SystemsFeb-14-2020, 18:56:25 GMT

artificial intelligence, discrete distribution, machine learning, (4 more...)

Neural Information Processing Systems

Genre: Research Report (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.69)

Add feedback

Communication-Efficient Distributed Learning of Discrete Distributions

Diakonikolas, Ilias, Grigorescu, Elena, Li, Jerry, Natarajan, Abhiram, Onak, Krzysztof, Schmidt, Ludwig

Neural Information Processing SystemsDec-31-2017

We initiate a systematic investigation of distribution learning (density estimation) when the data is distributed across multiple servers. The servers must communicate with a referee and the goal is to estimate the underlying distribution with as few bits of communication as possible. We focus on non-parametric density estimation of discrete distributions with respect to the l1 and l2 norms. We provide the first non-trivial upper and lower bounds on the communication complexity of this basic estimation task in various settings of interest. Specifically, our results include the following: 1. When the unknown discrete distribution is unstructured and each server has only one sample, we show that any blackboard protocol (i.e., any protocol in which servers interact arbitrarily using public messages) that learns the distribution must essentially communicate the entire sample. 2. For the case of structured distributions, such as k-histograms and monotone distributions, we design distributed learning algorithms that achieve significantly better communication guarantees than the naive ones, and obtain tight upper and lower bounds in several regimes. Our distributed learning algorithms run in near-linear time and are robust to model misspecification. Our results provide insights on the interplay between structure and communication efficiency for a range of fundamental distribution estimation tasks.

artificial intelligence, communication, machine learning, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback