Plotting

 Stollenwerk, Felix


The Mathematical Relationship Between Layer Normalization and Dynamic Activation Functions

arXiv.org Artificial Intelligence

A recent paper proposes Dynamic Tanh (DyT) as a drop-in replacement for layer normalization (LN). Although the method is empirically well-motivated and appealing from a practical point of view, it lacks a theoretical foundation. In this work, we shed light on the mathematical relationship between layer normalization and dynamic activation functions. In particular, we derive DyT from LN and show that a well-defined approximation is needed to do so. By dropping said approximation, an alternative activation function is obtained, which we call Dynamic Inverse Square Root Unit (DyISRU). DyISRU is the exact counterpart of layer normalization, and we demonstrate numerically that it indeed resembles LN more accurately than DyT does.


Better Embeddings with Coupled Adam

arXiv.org Artificial Intelligence

Despite their remarkable capabilities, LLMs learn word representations that exhibit the undesirable yet poorly understood feature of anisotropy. In this paper, we argue that the second moment in Adam is a cause of anisotropic embeddings, and suggest a modified optimizer called Coupled Adam to mitigate the problem. Our experiments demonstrate that Coupled Adam significantly improves the quality of embeddings, while also leading to better upstream and downstream performance on large enough datasets.


nerblackbox: A High-level Library for Named Entity Recognition in Python

arXiv.org Artificial Intelligence

We present nerblackbox, a python library to facilitate the use of state-of-the-art transformer-based models for named entity recognition. It provides simple-to-use yet powerful methods to access data and models from a wide range of sources, for fully automated model training and evaluation as well as versatile model inference. While many technical challenges are solved and hidden from the user by default, nerblackbox also offers fine-grained control and a rich set of customizable features. It is thus targeted both at application-oriented developers as well as machine learning experts and researchers.


Text Annotation Handbook: A Practical Guide for Machine Learning Projects

arXiv.org Artificial Intelligence

This handbook is a hands-on guide on how to approach text annotation tasks. It provides a gentle introduction to the topic, an overview of theoretical concepts as well as practical advice. The topics covered are mostly technical, but business, ethical and regulatory issues are also touched upon. The focus lies on readability and conciseness rather than completeness and scientific rigor. Experience with annotation and knowledge of machine learning are useful but not required. The document may serve as a primer or reference book for a wide range of professions such as team leaders, project managers, IT architects, software developers and machine learning engineers.


Annotated Job Ads with Named Entity Recognition

arXiv.org Artificial Intelligence

We have trained a named entity recognition (NER) model that screens Swedish job ads for different kinds of useful information (e.g. skills required from a job seeker). It was obtained by fine-tuning KB-BERT. The biggest challenge we faced was the creation of a labelled dataset, which required manual annotation. This paper gives an overview of the methods we employed to make the annotation process more efficient and to ensure high quality data. We also report on the performance of the resulting model.


GPT-SW3: An Autoregressive Language Model for the Nordic Languages

arXiv.org Artificial Intelligence

We have faced all of these challenges in our work on developing the first native LLM for the There is a growing interest in building and applying Nordic (or, more accurately, North Germanic) languages. Large Language Models (LLMs) for languages The LLM, which we call GPT-SW3, other than English. This interest has is a continuation of our previous Swedish-only been fuelled partly by the unprecedented popularity model (Ekgren et al., 2022), and is a collection of ChatGPT


Training and Evaluation of a Multilingual Tokenizer for GPT-SW3

arXiv.org Artificial Intelligence

Generative language models are pre-trained on large amounts of raw text data. Virtually all language model architectures require the text data to be tokenized, which means that a text string is split into a sequence of tokens and subsequently mapped to a sequence of integers, see Figure 1. Figure 1: Text preprocessing for language models (simplified). The first step is referred to as tokenization, although sometimes both the first and second step are embraced by the same term. Note that the character which appears in the above example represents whitespace (more on this in Sec. 3). Modern subword tokenizers are designed such that frequently used words are not decomposed while rare words are split into meaningful tokens.