Goto

Collaborating Authors

 spelling


Social Perceptions of English Spelling Variation on Twitter: A Comparative Analysis of Human and LLM Responses

Nguyen, Dong, Rosseel, Laura

arXiv.org Artificial Intelligence

Spelling variation (e.g. funnnn vs. fun) can influence the social perception of texts and their writers: we often have various associations with different forms of writing (is the text informal? does the writer seem young?). In this study, we focus on the social perception of spelling variation in online writing in English and study to what extent this perception is aligned between humans and large language models (LLMs). Building on sociolinguistic methodology, we compare LLM and human ratings on three key social attributes of spelling variation (formality, carefulness, age). We find generally strong correlations in the ratings between humans and LLMs. However, notable differences emerge when we analyze the distribution of ratings and when comparing between different types of spelling variation.


Post-OCR Text Correction for Bulgarian Historical Documents

Beshirov, Angel, Dobreva, Milena, Dimitrov, Dimitar, Hardalov, Momchil, Koychev, Ivan, Nakov, Preslav

arXiv.org Artificial Intelligence

The digitization of historical documents is crucial for preserving the cultural heritage of the society. An important step in this process is converting scanned images to text using Optical Character Recognition (OCR), which can enable further search, information extraction, etc. Unfortunately, this is a hard problem as standard OCR tools are not tailored to deal with historical orthography as well as with challenging layouts. Thus, it is standard to apply an additional text correction step on the OCR output when dealing with such documents. In this work, we focus on Bulgarian, and we create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We further develop a method for automatically generating synthetic data in this orthography, as well as in the subsequent Ivanchev orthography, by leveraging vast amounts of contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during recognition and improves the quality of the documents by 25\%, which is an increase of 16\% compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data and code at \url{https://github.com/angelbeshirov/post-ocr-text-correction}.}


A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling

Neural Information Processing Systems

The usability of Brain Computer Interfaces (BCI) based on the P300 speller is severely hindered by the need for long training times and many repetitions of the same stimulus. In this contribution we introduce a set of unsupervised hierarchical probabilistic models that tackle both problems simultaneously by incorporating prior knowledge from two sources: information from other training subjects (through transfer learning) and information about the words being spelled (through language models). We show, that due to this prior knowledge, the performance of the unsupervised models parallels and in some cases even surpasses that of supervised models, while eliminating the tedious training session.


IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

Javed, Tahir, Nawale, Janki Atul, George, Eldho Ittan, Joshi, Sakshi, Bhogale, Kaushal Santosh, Mehendale, Deovrat, Sethi, Ishvinder Virender, Ananthanarayanan, Aparna, Faquih, Hafsah, Palit, Pratiti, Ravishankar, Sneha, Sukumaran, Saranya, Panchagnula, Tripura, Murali, Sunjay, Gandhi, Kunal Sharad, R, Ambujavalli, M, Manickam K, Vaijayanthi, C Venkata, Karunganni, Krishnan Srinivasa Raghavan, Kumar, Pratyush, Khapra, Mitesh M

arXiv.org Artificial Intelligence

We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available


Two Approaches to Diachronic Normalization of Polish Texts

Dudzic, Kacper, Graliński, Filip, Jassem, Krzysztof, Kubis, Marek, Wierzchoń, Piotr

arXiv.org Artificial Intelligence

This paper discusses two approaches to the diachronic normalization of Polish texts: a rule-based solution that relies on a set of handcrafted patterns, and a neural normalization model based on the text-to-text transfer transformer architecture. The training and evaluation data prepared for the task are discussed in detail, along with experiments conducted to compare the proposed normalization solutions. A quantitative and qualitative analysis is made. It is shown that at the current stage of inquiry into the problem, the rule-based solution outperforms the neural one on 3 out of 4 variants of the prepared dataset, although in practice both approaches have distinct advantages and disadvantages.


Evaluating GPT-3.5 and GPT-4 on Grammatical Error Correction for Brazilian Portuguese

Penteado, Maria Carolina, Perez, Fábio

arXiv.org Artificial Intelligence

Although large language models (LLMs) have gained widespread attention for their performance in English language We investigate the effectiveness of GPT-3.5 and applications, recent studies have shown that they GPT-4, two large language models, as Grammatical can produce good results for other languages. While the Error Correction (GEC) tools for Brazilian amount of data available for training LLMs in languages Portuguese and compare their performance other than English is often more limited, the success of against Microsoft Word and Google Docs. We introduce these models in tasks such as translation, language modeling, a GEC dataset for Brazilian Portuguese and sentiment analysis demonstrates their potential for with four categories: Grammar, Spelling, Internet, improving language processing across a range of different and Fast typing. Our results show that languages.


Lenient Evaluation of Japanese Speech Recognition: Modeling Naturally Occurring Spelling Inconsistency

Karita, Shigeki, Sproat, Richard, Ishikawa, Haruko

arXiv.org Artificial Intelligence

Word error rate (WER) and character error rate (CER) are standard metrics in Speech Recognition (ASR), but one problem has always been alternative spellings: If one's system transcribes adviser whereas the ground truth has advisor, this will count as an error even though the two spellings really represent the same word. Japanese is notorious for ``lacking orthography'': most words can be spelled in multiple ways, presenting a problem for accurate ASR evaluation. In this paper we propose a new lenient evaluation metric as a more defensible CER measure for Japanese ASR. We create a lattice of plausible respellings of the reference transcription, using a combination of lexical resources, a Japanese text-processing system, and a neural machine translation model for reconstructing kanji from hiragana or katakana. In a manual evaluation, raters rated 95.4% of the proposed spelling variants as plausible. ASR results show that our method, which does not penalize the system for choosing a valid alternate spelling of a word, affords a 2.4%-3.1% absolute reduction in CER depending on the task.


A primer on getting neologisms from foreign languages to under-resourced languages

Camacho, Luis

arXiv.org Artificial Intelligence

Neologisms are certain uses, expressions, and words that did not traditionally exist in a language, but are incorporated into it due to the need of speakers to adapt to a new reality [1]. That is, neologisms are those new words and expressions that speakers incorporate into a language, as new things and new ways of doing to name arise. They are the exact opposite of archaisms. The appearance of neologisms is a common and ordinary process in all languages, forced as they are to adapt and update or die. However, a word can be considered a neologism only for a certain time, since once it has been incorporated and normalized as part of the language, it simply ceases to be a novelty. The simplest way to classify neologisms would be from the method used to create them, thus we have: 1. morphological neologisms: they are built using words that already exist in the language, through the processes of composition or derivation. For example, the word "aircraft" was once a neologism, made up of the prefix "air" and the suffix "craft". This also happens with "teleoperators" or with "biosecurity".


Spelling convention sensitivity in neural language models

Nielsen, Elizabeth, Kirov, Christo, Roark, Brian

arXiv.org Artificial Intelligence

We examine whether large neural language models, trained on very large collections of varied English text, learn the potentially long-distance dependency of British versus American spelling conventions, i.e., whether spelling is consistently one or the other within model-generated strings. In contrast to long-distance dependencies in non-surface underlying structure (e.g., syntax), spelling consistency is easier to measure both in LMs and the text corpora used to train them, which can provide additional insight into certain observed model behaviors. Using a set of probe words unique to either British or American English, we first establish that training corpora exhibit substantial (though not total) consistency. A large T5 language model does appear to internalize this consistency, though only with respect to observed lexical items (not nonce words with British/American spelling patterns). We further experiment with correcting for biases in the training data by fine-tuning T5 on synthetic data that has been debiased, and find that finetuned T5 remains only somewhat sensitive to spelling consistency. Further experiments show GPT2 to be similarly limited.


AI image generator Midjourney blocks porn by banning words about the human reproductive system

MIT Technology Review

Midjourney's founder, David Holz, says it's banning these words as a stopgap measure to prevent people from generating shocking or gory content while the company "improves things on the AI side." Holz says moderators watch how words are being used and what kinds of images are being generated, and adjust the bans periodically. The firm has a community guidelines page that lists the type of content it blocks in this way, including sexual imagery, gore, and even the emoji, which is often used as a symbol for the buttocks. AI models such as Midjourney, DALL-E 2, and Stable Diffusion are trained on billions of images that have been scraped from the internet. Research by a team at the University of Washington has found that such models learn biases that sexually objectify women, which are then reflected in the images they produce.