Goto

Collaborating Authors

 vocabulary


The Role of Vocabularies in Learning Sparse Representations for Ranking

Kim, Hiun, Lee, Tae Kwan, Won, Taeryun

arXiv.org Artificial Intelligence

Learned Sparse Retrieval (LSR) such as SPLADE has growing interest for effective semantic 1st stage matching while enjoying the efficiency of inverted indices. A recent work on learning SPLADE models with expanded vocabularies (ESPLADE) was proposed to represent queries and documents into a sparse space of custom vocabulary which have different levels of vocabularic granularity. Within this effort, however, there have not been many studies on the role of vocabulary in SPLADE models and their relationship to retrieval efficiency and effectiveness. To study this, we construct BERT models with 100K-sized output vocabularies, one initialized with the ESPLADE pretraining method and one initialized randomly. After fine-tune on real-world search click logs, we applied logit score-based queries and documents pruning to max size for further balancing efficiency. The experimental result in our evaluation set shows that, when pruning is applied, the two models are effective compared to the 32K-sized normal SPLADE model in the computational budget under the BM25. And the ESPLADE models are more effective than the random vocab model, while having a similar retrieval cost. The result indicates that the size and pretrained weight of output vocabularies play the role of configuring the representational specification for queries, documents, and their interactions in the retrieval engine, beyond their original meaning and purposes in NLP. These findings can provide a new room for improvement for LSR by identifying the importance of representational specification from vocabulary configuration for efficient and effective retrieval.


Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Neural Information Processing Systems

Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the conclusion that the optimal vocabulary size depends on the compute budget, with larger models requiring larger vocabularies. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K.


Opening the Vocabulary of Egocentric Actions

Neural Information Processing Systems

Human actions in egocentric videos often feature hand-object interactions composed of a verb (performed by the hand) applied to an object. Despite their extensive scaling up, egocentric datasets still face two limitations -- sparsity of action compositions and a closed set of interacting objects. This paper proposes a novel open vocabulary action recognition task. Given a set of verbs and objects observed during training, the goal is to generalize the verbs to an open vocabulary of actions with seen and novel objects. To this end, we decouple the verb and object predictions via an object-agnostic verb encoder and a prompt-based object encoder.


Leveraging Ontologies to Document Bias in Data

Russo, Mayra, Vidal, Maria-Esther

arXiv.org Artificial Intelligence

The breakthroughs and benefits attributed to big data and, consequently, to machine learning (ML) - or AIsystems [1, 2], have also resulted in making prevalent how these systems are capable of producing unexpected, biased, and in some cases, undesirable output [3, 4, 5]. Seminal work on bias (i.e., prejudice for, or against one person, or group, especially in a way considered to be unfair) in the context of ML systems demonstrates how facial recognition tools and popular search engines can exacerbate demographic disparities, worsening the marginalization of minorities at the individual and group level [6, 7]. Further, biases in news recommenders and social media feeds actively play a role in conditioning and manipulating people's behavior and amplifying individual and public opinion polarization [8, 9]. In this context, the last few years have seen the consolidation of the Trustworthy AI framework, led in large part by regulatory bodies [10], with the objective of guiding commercial AI development to proactively account for ethical, legal, and technical dimensions [11]. Furthermore, this framework is also accompanied by the call to establish standards across the field in order to ensure AI systems are safe, secure and fair upon deployment [11]. In terms of AI bias, many efforts have been concentrated in devising methods that can improve its identification, understanding, measurement, and mitigation [12]. For example, the special publication prepared by the National Institute of Standards and Technology (NIST) proposes a thorough, however not exhaustive, categorization of different types of bias in AI beyond common computational definitions (see Figure 1 for core hierarchy) [13]. In this same direction, some scholars advocate for practices that account for the characteristics of ML pipelines (i.e., datasets, ML algorithms, and user interaction loop) [14] to enable actors concerned with its research, development, regulation, and use, to inspect all the actions performed across the engineering process, with the objective to increase trust placed not only on the development processes, but on the systems themselves [15, 16, 17, 18].


Automated Generation of Multiple-Choice Cloze Questions for Assessing English Vocabulary Using GPT-turbo 3.5

Wang, Qiao, Rose, Ralph, Orita, Naho, Sugawara, Ayaka

arXiv.org Artificial Intelligence

A common way of assessing language learners' mastery of vocabulary is via multiple-choice cloze (i.e., fill-in-the-blank) questions. But the creation of test items can be laborious for individual teachers or in large-scale language programs. In this paper, we evaluate a new method for automatically generating these types of questions using large language models (LLM). The VocaTT (vocabulary teaching and training) engine is written in Python and comprises three basic steps: pre-processing target word lists, generating sentences and candidate word options using GPT, and finally selecting suitable word options. To test the efficiency of this system, 60 questions were generated targeting academic words. The generated items were reviewed by expert reviewers who judged the well-formedness of the sentences and word options, adding comments to items judged not well-formed. Results showed a 75% rate of well-formedness for sentences and 66.85% rate for suitable word options. This is a marked improvement over the generator used earlier in our research which did not take advantage of GPT's capabilities. Post-hoc qualitative analysis reveals several points for improvement in future work including cross-referencing part-of-speech tagging, better sentence validation, and improving GPT prompts.


On the Off-Target Problem of Zero-Shot Multilingual Neural Machine Translation

Chen, Liang, Ma, Shuming, Zhang, Dongdong, Wei, Furu, Chang, Baobao

arXiv.org Artificial Intelligence

While multilingual neural machine translation has achieved great success, it suffers from the off-target issue, where the translation is in the wrong language. This problem is more pronounced on zero-shot translation tasks. In this work, we find that failing in encoding discriminative target language signal will lead to off-target and a closer lexical distance (i.e., KL-divergence) between two languages' vocabularies is related with a higher off-target rate. We also find that solely isolating the vocab of different languages in the decoder can alleviate the problem. Motivated by the findings, we propose Language Aware Vocabulary Sharing (LAVS), a simple and effective algorithm to construct the multilingual vocabulary, that greatly alleviates the off-target problem of the translation model by increasing the KL-divergence between languages. We conduct experiments on a multilingual machine translation benchmark in 11 languages. Experiments show that the off-target rate for 90 translation tasks is reduced from 29\% to 8\%, while the overall BLEU score is improved by an average of 1.9 points without extra training cost or sacrificing the supervised directions' performance. We release the code at https://github.com/PKUnlp-icler/Off-Target-MNMT for reproduction.


Converting Your Audio to Text with Amazon Transcribe

#artificialintelligence

Amazon Transcribe is one of Amazon Web Services' (AWS) machine learning offerings. You input audio or video; Transcribe converts it to text, allowing you to identify the languages used and the number of speakers in the process. You can then take this transcription and do multiple things with it, including search, analytics, subtitles, translations, or even feeding it back into Amazon Polly to read your transcription back to you. When you start a Transcribe job, you're asked to pick out the language that's being spoken -- or have Transcribe automatically detect it for you. Also, there was really no rhyme or reason to the words and phrases I picked, other than they were the first that came to mind!


WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain

Shah, Raj Sanjay, Chawla, Kunal, Eidnani, Dheeraj, Shah, Agam, Du, Wendi, Chava, Sudheer, Raman, Natraj, Smiley, Charese, Chen, Jiaao, Yang, Diyi

arXiv.org Artificial Intelligence

Pre-trained language models have shown impressive performance on a variety of tasks and domains. Previous research on financial language models usually employs a generic training scheme to train standard model architectures, without completely leveraging the richness of the financial data. We propose a novel domain specific Financial LANGuage model (FLANG) which uses financial keywords and phrases for better masking, together with span boundary objective and in-filing objective. Additionally, the evaluation benchmarks in the field have been limited. To this end, we contribute the Financial Language Understanding Evaluation (FLUE), an open-source comprehensive suite of benchmarks for the financial domain. These include new benchmarks across 5 NLP tasks in financial domain as well as common benchmarks used in the previous research. Experiments on these benchmarks suggest that our model outperforms those in prior literature on a variety of NLP tasks. Our models, code and benchmark data are publicly available on Github and Huggingface.


How Masterly Are People at Playing with Their Vocabulary? Analysis of the Wordle Game for Latvian

Rikters, Matīss, Reinsone, Sanita

arXiv.org Artificial Intelligence

In this paper, we describe adaptation of a simple word guessing game that occupied the hearts and minds of people around the world. There are versions for all three Baltic countries and even several versions of each. We specifically pay attention to the Latvian version and look into how people form their guesses given any already uncovered hints. The paper analyses guess patterns, easy and difficult word characteristics, and player behaviour and response.


Theory Behind the Basics of NLP - Analytics Vidhya

#artificialintelligence

This article was published as a part of the Data Science Blogathon. Natural Language Processing (NLP) can help you to understand any text's sentiments. This is helpful for people to understand the emotions and the type of text they are looking over. Negative and Positive comments can be easily differentiated. NLP wanted to make machines understand the text or comment the same way humans can.