AITopics | unigram distribution

Collaborating Authors

unigram distribution

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Two Counterexamples to Tokenization and the Noiseless Channel

Cognetta, Marco, Zouhar, Vilém, Moon, Sangwhan, Okazaki, Naoaki

arXiv.org Artificial IntelligenceFeb-29-2024

In Tokenization and the Noiseless Channel (Zouhar et al., 2023a), R\'enyi efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer: for NLP tasks, the tokenizer which leads to the highest R\'enyi efficiency of the unigram distribution should be chosen. The R\'enyi efficiency is thus treated as a predictor of downstream performance (e.g., predicting BLEU for a machine translation task), without the expensive step of training multiple models with different tokenizers. Although useful, the predictive power of this metric is not perfect, and the authors note there are additional qualities of a good tokenization scheme that R\'enyi efficiency alone cannot capture. We describe two variants of BPE tokenization which can arbitrarily increase R\'enyi efficiency while decreasing the downstream model performance. These counterexamples expose cases where R\'enyi efficiency fails as an intrinsic tokenization metric and thus give insight for building more accurate predictors.

efficiency, nyi efficiency, tokenizer, (15 more...)

arXiv.org Artificial Intelligence

2402.14614

Country:

North America > Canada > Ontario > Toronto (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
(3 more...)

Genre:

Research Report (0.64)
Summary/Review (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.89)

Add feedback

A Geometric Notion of Causal Probing

Guerner, Clément, Svete, Anej, Liu, Tianyu, Warstadt, Alexander, Cotterell, Ryan

arXiv.org Artificial IntelligenceJul-30-2023

Large language models rely on real-valued representations of text to make their predictions. These representations contain information learned from the data that the model has trained on, including knowledge of linguistic properties and forms of demographic bias, e.g., based on gender. A growing body of work has considered removing information about concepts such as these using orthogonal projections onto subspaces of the representation space. We contribute to this body of work by proposing a formal definition of $\textit{intrinsic}$ information in a subspace of a language model's representation space. We propose a counterfactual approach that avoids the failure mode of spurious correlations (Kumar et al., 2022) by treating components in the subspace and its orthogonal complement independently. We show that our counterfactual notion of information in a subspace is optimized by a $\textit{causal}$ concept subspace. Furthermore, this intervention allows us to attempt concept controlled generation by manipulating the value of the conceptual component of a representation. Empirically, we find that R-LACE (Ravfogel et al., 2022) returns a one-dimensional subspace containing roughly half of total concept information under our framework. Our causal controlled intervention shows that, for at least one model, the subspace returned by R-LACE can be used to manipulate the concept value of the generated word with precision.

information, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2307.15054

Country:

North America > Canada > Ontario > Toronto (0.04)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
Europe > Italy > Tuscany > Pisa Province > Pisa (0.04)
(8 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)

Add feedback

Tokenization and the Noiseless Channel

Zouhar, Vilém, Meister, Clara, Gastaldi, Juan Luis, Du, Li, Sachan, Mrinmaya, Cotterell, Ryan

arXiv.org Artificial IntelligenceJun-29-2023

Subword tokenization is a key part of many NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to better downstream model performance than others. We propose that good tokenizers lead to \emph{efficient} channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum possible entropy of the token distribution. Yet, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency tokens and very short codes to high-frequency tokens. Defining efficiency in terms of R\'enyi entropy, on the other hand, penalizes distributions with either very high or very low-frequency tokens. In machine translation, we find that across multiple tokenizers, the R\'enyi entropy with $\alpha = 2.5$ has a very strong correlation with \textsc{Bleu}: $0.78$ in comparison to just $-0.32$ for compressed length.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2306.16842

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Switzerland > Basel-City > Basel (0.04)
(7 more...)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.90)

Add feedback

A Natural Bias for Language Generation Models

Meister, Clara, Stokowiec, Wojciech, Pimentel, Tiago, Yu, Lei, Rimell, Laura, Kuncoro, Adhiguna

arXiv.org Artificial IntelligenceJun-23-2023

After just a few hundred training updates, a standard probabilistic model for language generation has likely not yet learnt many semantic or syntactic rules of natural language, making it difficult to estimate the probability distribution over next tokens. Yet around this point, these models have identified a simple, loss-minimising behaviour: to output the unigram distribution of the target training corpus. The use of such a heuristic raises the question: Can we initialise our models with this behaviour and save precious compute resources and model capacity? Here we show that we can effectively endow standard neural language generation models with a separate module that reflects unigram frequency statistics as prior knowledge, simply by initialising the bias term in a model's final linear layer with the log-unigram distribution. We use neural machine translation as a test bed for this simple technique and observe that it: (i) improves learning efficiency; (ii) achieves better overall performance; and perhaps most importantly (iii) appears to disentangle strong frequency effects by encouraging the model to specialise in non-frequency-related aspects of language.

computational linguistic, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2212.09686

Country:

North America > Dominican Republic (0.04)
Europe > Italy > Sardinia (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(13 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.90)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

SAS: Self-Augmented Strategy for Language Model Pre-training

Xu, Yifei, Zhang, Jingqiao, He, Ru, Ge, Liangzhu, Yang, Chao, Yang, Cheng, Wu, Ying Nian

arXiv.org Artificial IntelligenceJun-14-2021

The core of a self-supervised learning method for pre-training language models includes the design of appropriate data augmentation and corresponding pre-training task(s). Most data augmentations in language model pre-training are context-independent. The seminal contextualized augmentation recently proposed by the ELECTRA requires a separate generator, which leads to extra computation cost as well as the challenge in adjusting the capability of its generator relative to that of the other model component(s). We propose a self-augmented strategy (SAS) that uses a single forward pass through the model to augment the input data for model training in the next epoch. Essentially our strategy eliminates a separate generator network and uses only one network to generate the data augmentation and undertake two pre-training tasks (the MLM task and the RTD task) jointly, which naturally avoids the challenge in adjusting the generator's capability as well as reduces the computation cost. Additionally, our SAS is a general strategy such that it can seamlessly incorporate many new techniques emerging recently or in the future, such as the disentangled attention mechanism recently proposed by the DeBERTa model. Our experiments show that our SAS is able to outperform the ELECTRA and other state-of-the-art models in the GLUE tasks with the same or less computation cost.

electra-small, sas, score 8, (12 more...)

arXiv.org Artificial Intelligence

2106.07176

Country: North America > United States > California > Los Angeles County > Los Angeles (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.54)

Add feedback

Federated Marginal Personalization for ASR Rescoring

Liu, Zhe, Peng, Fuchun

arXiv.org Machine LearningDec-1-2020

We introduce federated marginal personalization (FMP), a novel method for continuously updating personalized neural network language models (NNLMs) on private devices using federated learning (FL). Instead of fine-tuning the parameters of NNLMs on personal data, FMP regularly estimates global and personalized marginal distributions of words, and adjusts the probabilities from NNLMs by an adaptation factor that is specific to each word. Our presented approach can overcome the limitations of federated fine-tuning and efficiently learn personalized NNLMs on devices. We study the application of FMP on second-pass ASR rescoring tasks. Experiments on two speech evaluation datasets show modest word error rate (WER) reductions. We also demonstrate that FMP could offer reasonable privacy with only a negligible cost in speech recognition accuracy.

language model, nnlm, unigram distribution, (15 more...)

arXiv.org Machine Learning

2012.00898

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > California > San Mateo County > Menlo Park (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

KL Sum algorithm for text summarization

#artificialintelligenceOct-17-2019, 17:59:18 GMT

This quantity represents the divergence between true distribution P (here document set unigram) and the approximating distribution Q (the summary S distribution). This summarization method finds a set of summary sentences which closely match the document set unigram distribution.

artificial intelligence, divergence, natural language, (14 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Natural Language (0.67)

Add feedback

word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method

Goldberg, Yoav, Levy, Omer

arXiv.org Machine LearningFeb-15-2014

The word2vec software of Tomas Mikolov and colleagues (https://code.google.com/p/word2vec/ ) has gained a lot of traction lately, and provides state-of-the-art word embeddings. The learning models behind the software are described in two research papers. We found the description of the models in these papers to be somewhat cryptic and hard to follow. While the motivations and presentation may be obvious to the neural-networks language-modeling crowd, we had to struggle quite a bit to figure out the rationale behind the equations. This note is an attempt to explain equation (4) (negative sampling) in "Distributed Representations of Words and Phrases and their Compositionality" by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

1402.3722

Country: North America > United States (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback