Goto

Collaborating Authors

 character encoder


The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models

arXiv.org Artificial Intelligence

Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge suddenly and only late in training. We find that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.


A Method for Training-free Person Image Picture Generation

arXiv.org Artificial Intelligence

The current state-of-the-art Diffusion model has demonstrated excellent results in generating images. However, the images are monotonous and are mostly the result of the distribution of images of people in the training set, making it challenging to generate multiple images for a fixed number of individuals. This problem can often only be solved by fine-tuning the training of the model. This means that each individual/animated character image must be trained if it is to be drawn, and the hardware and cost of this training is often beyond the reach of the average user, who accounts for the largest number of people. To solve this problem, the Character Image Feature Encoder model proposed in this paper enables the user to use the process by simply providing a picture of the character to make the image of the character in the generated image match the expectation. In addition, various details can be adjusted during the process using prompts. Unlike traditional Image-to-Image models, the Character Image Feature Encoder extracts only the relevant image features, rather than information about the model's composition or movements. In addition, the Character Image Feature Encoder can be adapted to different models after training. The proposed model can be conveniently incorporated into the Stable Diffusion generation process without modifying the model's ontology or used in combination with Stable Diffusion as a joint model.


Improving Diffusion Models for Scene Text Editing with Dual Encoders

arXiv.org Artificial Intelligence

Scene text editing is a challenging task that involves modifying or inserting specified texts in an image while maintaining its natural and realistic appearance. Most previous approaches to this task rely on style-transfer models that crop out text regions and feed them into image transfer models, such as GANs. However, these methods are limited in their ability to change text style and are unable to insert texts into images. Recent advances in diffusion models have shown promise in overcoming these limitations with text-conditional image editing. However, our empirical analysis reveals that state-of-the-art diffusion models struggle with rendering correct text and controlling text style. To address these problems, we propose DIFFSTE to improve pre-trained diffusion models with a dual encoder design, which includes a character encoder for better text legibility and an instruction encoder for better style control. An instruction tuning framework is introduced to train our model to learn the mapping from the text instruction to the corresponding image with either the specified style or the style of the surrounding texts in the background. Such a training method further brings our method the zero-shot generalization ability to the following three scenarios: generating text with unseen font variation, e.g., italic and bold, mixing different fonts to construct a new font, and using more relaxed forms of natural language as the instructions to guide the generation task. We evaluate our approach on five datasets and demonstrate its superior performance in terms of text correctness, image naturalness, and style controllability. Our code is publicly available. https://github.com/UCSB-NLP-Chang/DiffSTE


ZEN 2.0: Continue Training and Adaption for N-gram Enhanced Text Encoders

arXiv.org Artificial Intelligence

Pre-trained text encoders have drawn sustaining attention in natural language processing (NLP) and shown their capability in obtaining promising results in different tasks. Recent studies illustrated that external self-supervised signals (or knowledge extracted by unsupervised learning, such as n-grams) are beneficial to provide useful semantic evidence for understanding languages such as Chinese, so as to improve the performance on various downstream tasks accordingly. To further enhance the encoders, in this paper, we propose to pre-train n-gram-enhanced encoders with a large volume of data and advanced techniques for training. Moreover, we try to extend the encoder to different languages as well as different domains, where it is confirmed that the same architecture is applicable to these varying circumstances and new state-of-the-art performance is observed from a long list of NLP tasks across languages and domains.


Improved Multi-Stage Training of Online Attention-based Encoder-Decoder Models

arXiv.org Machine Learning

IMPROVED MUL TI-ST AGE TRAINING OF ONLINE A TTENTION-BASED ENCODER-DECODER MODELS Abhinav Garg, Dhananjaya Gowda, Ankur Kumar, Kwangyoun Kim, Mehul Kumar, Chanwoo Kim Speech Processing Lab, AI Center, Samsung Research, Korea ABSTRACT In this paper, we propose a refined multistage multi-task training strategy to improve the performance of online attention-based encoder-decoder (AED) models. A three-stage training based on three levels of architectural granularity namely, character encoder, byte pair encoding (BPE) based encoder, and attention decoder, is proposed. Also, multi-task learning based on two-levels of linguistic granularity namely, character and BPE, is used. We explore different pre-training strategies for the encoders including transfer learning from a bidirectional encoder. Our models achieve a word error rate (WER) of 5.04% and 4.48% on the Librispeech test-clean data for the smaller and bigger models respectively after fusion with long short-term memory (LSTM) based external language model (LM). Index T erms-- Attention based encoder-decoder models, online attention, multistage training, multi-task learning 1. INTRODUCTION Recently, attention-based encoder-decoder (AED) models have gained popularity for developing end-to-end neural network based automatic speech recognition (ASR) systems [1, 2, 3]. One of the primary advantages of AED models is that the language information is tightly coupled into the decoder, obviating the need for an external language model (LM). AED models have been shown to perform better than other end-to-end models, namely, connectionist temporal classification (CTC) and recurrent neural network transducer (RNN-T) models [4].


Training on Synthetic Noise Improves Robustness to Natural Noise in Machine Translation

arXiv.org Machine Learning

We consider the problem of making machine translation more robust to character-level variation at the source side, such as typos. Existing methods achieve greater coverage by applying subword models such as byte-pair encoding (BPE) and character-level encoders, but these methods are highly sensitive to spelling mistakes. We show how training on a mild amount of random synthetic noise can dramatically improve robustness to these variations, without diminishing performance on clean text. We focus on translation performance on natural noise, as captured by frequent corrections in Wikipedia edit logs, and show that robustness to such noise can be achieved using a balanced diet of simple synthetic noises at training time, without access to the natural noise data or distribution.


Resource Mention Extraction for MOOC Discussion Forums

arXiv.org Artificial Intelligence

In discussions hosted on discussion forums for Massive Online Open Courses (MOOCs), references to online learning resources are often of central importance. However they are usually mentioned in free text, without appropriate hyperlinking to their associated resource. Automated learning resource mention hyperlinking and categorization will facilitate discussion and searching within MOOC forums, and also benefit the contextualization of such resources across disparate views. We propose the novel problem of learning resource mention identification inMOOC forums; i.e., to identify resource mentions in discussions, and classify them into predefined resource types. As this is a novel task with no publicly available data, we first contribute a large-scale labeled dataset - dubbed the Forum Resource Mention (FoRM) dataset - to facilitate our current research and future research on this task. FoRM contains over 10, 000 real-world forum threads in collaboration with Coursera, with more than 23, 000 manually labeled resource mentions. We then formulate this task as a sequence tagging problem and investigate solutionarchitectures to address the problem. Corresponding author Email address: peterpan10211020@gmail.com (Liangming Pan) Preprint submitted to Elsevier November 22, 2018 two major challenges that hinder the application of sequence tagging models tothe task: (1) the diversity of resource mention expression, and (2) long-range contextual dependencies. We address these challenges by incorporating character-leveland thread context information into a LSTM-CRF model. First, we incorporate a character encoder to address the out-ofvocabulary problemcaused by the diversity of mention expressions. Second, to address the context dependency challenge, we encode thread contexts using anRNN-based context encoder, and apply the attention mechanism to selectively leverage useful context information during sequence tagging. Experiments onFoRM show that the proposed method improves the baseline deep sequence tagging models notably, significantly bettering performance on instances that exemplify the two challenges.