AITopics | vocabulary

Collaborating Authors

vocabulary

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Opening the Vocabulary of Egocentric Actions

Neural Information Processing SystemsDec-25-2025, 20:22:49 GMT

Human actions in egocentric videos often feature hand-object interactions composed of a verb (performed by the hand) applied to an object. Despite their extensive scaling up, egocentric datasets still face two limitations -- sparsity of action compositions and a closed set of interacting objects. This paper proposes a novel open vocabulary action recognition task. Given a set of verbs and objects observed during training, the goal is to generalize the verbs to an open vocabulary of actions with seen and novel objects. To this end, we decouple the verb and object predictions via an object-agnostic and a prompt-based . The prompting leverages CLIP representations to predict an open vocabulary of interacting objects. We create open vocabulary benchmarks on the EPIC-KITCHENS-100 and Assembly101 datasets; whereas closed-action methods fail to generalize, our proposed method is effective. In addition, our object encoder significantly outperforms existing open-vocabulary visual recognition methods in recognizing novel interacting objects.

egocentric action, name change, vocabulary, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.98)

Add feedback

The Role of Vocabularies in Learning Sparse Representations for Ranking

Kim, Hiun, Lee, Tae Kwan, Won, Taeryun

arXiv.org Artificial IntelligenceSep-23-2025

Learned Sparse Retrieval (LSR) such as SPLADE has growing interest for effective semantic 1st stage matching while enjoying the efficiency of inverted indices. A recent work on learning SPLADE models with expanded vocabularies (ESPLADE) was proposed to represent queries and documents into a sparse space of custom vocabulary which have different levels of vocabularic granularity. Within this effort, however, there have not been many studies on the role of vocabulary in SPLADE models and their relationship to retrieval efficiency and effectiveness. To study this, we construct BERT models with 100K-sized output vocabularies, one initialized with the ESPLADE pretraining method and one initialized randomly. After fine-tune on real-world search click logs, we applied logit score-based queries and documents pruning to max size for further balancing efficiency. The experimental result in our evaluation set shows that, when pruning is applied, the two models are effective compared to the 32K-sized normal SPLADE model in the computational budget under the BM25. And the ESPLADE models are more effective than the random vocab model, while having a similar retrieval cost. The result indicates that the size and pretrained weight of output vocabularies play the role of configuring the representational specification for queries, documents, and their interactions in the retrieval engine, beyond their original meaning and purposes in NLP. These findings can provide a new room for improvement for LSR by identifying the importance of representational specification from vocabulary configuration for efficient and effective retrieval.

information retrieval, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2509.16621

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Neural Information Processing SystemsMay-27-2025, 17:13:15 GMT

Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the conclusion that the optimal vocabulary size depends on the compute budget, with larger models requiring larger vocabularies. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K.

larger model deserve larger vocabulary, vocabulary, vocabulary size, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Opening the Vocabulary of Egocentric Actions

Neural Information Processing SystemsJan-19-2025, 00:33:59 GMT

egocentric action, generalize, vocabulary, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.84)

Add feedback

Leveraging Ontologies to Document Bias in Data

Russo, Mayra, Vidal, Maria-Esther

arXiv.org Artificial IntelligenceJun-29-2024

The breakthroughs and benefits attributed to big data and, consequently, to machine learning (ML) - or AIsystems [1, 2], have also resulted in making prevalent how these systems are capable of producing unexpected, biased, and in some cases, undesirable output [3, 4, 5]. Seminal work on bias (i.e., prejudice for, or against one person, or group, especially in a way considered to be unfair) in the context of ML systems demonstrates how facial recognition tools and popular search engines can exacerbate demographic disparities, worsening the marginalization of minorities at the individual and group level [6, 7]. Further, biases in news recommenders and social media feeds actively play a role in conditioning and manipulating people's behavior and amplifying individual and public opinion polarization [8, 9]. In this context, the last few years have seen the consolidation of the Trustworthy AI framework, led in large part by regulatory bodies [10], with the objective of guiding commercial AI development to proactively account for ethical, legal, and technical dimensions [11]. Furthermore, this framework is also accompanied by the call to establish standards across the field in order to ensure AI systems are safe, secure and fair upon deployment [11]. In terms of AI bias, many efforts have been concentrated in devising methods that can improve its identification, understanding, measurement, and mitigation [12]. For example, the special publication prepared by the National Institute of Standards and Technology (NIST) proposes a thorough, however not exhaustive, categorization of different types of bias in AI beyond common computational definitions (see Figure 1 for core hierarchy) [13]. In this same direction, some scholars advocate for practices that account for the characteristics of ML pipelines (i.e., datasets, ML algorithms, and user interaction loop) [14] to enable actors concerned with its research, development, regulation, and use, to inspect all the actions performed across the engineering process, with the objective to increase trust placed not only on the development processes, but on the systems themselves [15, 16, 17, 18].

doc-biaso, doi, ontology, (16 more...)

arXiv.org Artificial Intelligence

2407.00509

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > Germany > Lower Saxony > Hanover (0.04)
(10 more...)

Genre:

Research Report (0.64)
Overview (0.46)

Industry:

Law (1.00)
Government > Regional Government > North America Government > United States Government (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Add feedback

Automated Generation of Multiple-Choice Cloze Questions for Assessing English Vocabulary Using GPT-turbo 3.5

Wang, Qiao, Rose, Ralph, Orita, Naho, Sugawara, Ayaka

arXiv.org Artificial IntelligenceMar-4-2024

A common way of assessing language learners' mastery of vocabulary is via multiple-choice cloze (i.e., fill-in-the-blank) questions. But the creation of test items can be laborious for individual teachers or in large-scale language programs. In this paper, we evaluate a new method for automatically generating these types of questions using large language models (LLM). The VocaTT (vocabulary teaching and training) engine is written in Python and comprises three basic steps: pre-processing target word lists, generating sentences and candidate word options using GPT, and finally selecting suitable word options. To test the efficiency of this system, 60 questions were generated targeting academic words. The generated items were reviewed by expert reviewers who judged the well-formedness of the sentences and word options, adding comments to items judged not well-formed. Results showed a 75% rate of well-formedness for sentences and 66.85% rate for suitable word options. This is a marked improvement over the generator used earlier in our research which did not take advantage of GPT's capabilities. Post-hoc qualitative analysis reveals several points for improvement in future work including cross-referencing part-of-speech tagging, better sentence validation, and improving GPT prompts.

distractor, question stem, syntax, (17 more...)

arXiv.org Artificial Intelligence

2403.02078

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > France (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.88)

Add feedback

On the Off-Target Problem of Zero-Shot Multilingual Neural Machine Translation

Chen, Liang, Ma, Shuming, Zhang, Dongdong, Wei, Furu, Chang, Baobao

arXiv.org Artificial IntelligenceJun-1-2023

While multilingual neural machine translation has achieved great success, it suffers from the off-target issue, where the translation is in the wrong language. This problem is more pronounced on zero-shot translation tasks. In this work, we find that failing in encoding discriminative target language signal will lead to off-target and a closer lexical distance (i.e., KL-divergence) between two languages' vocabularies is related with a higher off-target rate. We also find that solely isolating the vocab of different languages in the decoder can alleviate the problem. Motivated by the findings, we propose Language Aware Vocabulary Sharing (LAVS), a simple and effective algorithm to construct the multilingual vocabulary, that greatly alleviates the off-target problem of the translation model by increasing the KL-divergence between languages. We conduct experiments on a multilingual machine translation benchmark in 11 languages. Experiments show that the off-target rate for 90 translation tasks is reduced from 29\% to 8\%, while the overall BLEU score is improved by an average of 1.9 points without extra training cost or sacrificing the supervised directions' performance. We release the code at https://github.com/PKUnlp-icler/Off-Target-MNMT for reproduction.

artificial intelligence, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2305.1093

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Samoa (0.05)
North America > Dominican Republic (0.04)
(9 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Converting Your Audio to Text with Amazon Transcribe

#artificialintelligenceFeb-28-2023, 08:45:38 GMT

Amazon Transcribe is one of Amazon Web Services' (AWS) machine learning offerings. You input audio or video; Transcribe converts it to text, allowing you to identify the languages used and the number of speakers in the process. You can then take this transcription and do multiple things with it, including search, analytics, subtitles, translations, or even feeding it back into Amazon Polly to read your transcription back to you. When you start a Transcribe job, you're asked to pick out the language that's being spoken -- or have Transcribe automatically detect it for you. Also, there was really no rhyme or reason to the words and phrases I picked, other than they were the first that came to mind!

amazon transcribe, table vocabulary, vocabulary, (5 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence (0.41)

Add feedback

WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain

Shah, Raj Sanjay, Chawla, Kunal, Eidnani, Dheeraj, Shah, Agam, Du, Wendi, Chava, Sudheer, Raman, Natraj, Smiley, Charese, Chen, Jiaao, Yang, Diyi

arXiv.org Artificial IntelligenceOct-31-2022

Pre-trained language models have shown impressive performance on a variety of tasks and domains. Previous research on financial language models usually employs a generic training scheme to train standard model architectures, without completely leveraging the richness of the financial data. We propose a novel domain specific Financial LANGuage model (FLANG) which uses financial keywords and phrases for better masking, together with span boundary objective and in-filing objective. Additionally, the evaluation benchmarks in the field have been limited. To this end, we contribute the Financial Language Understanding Evaluation (FLUE), an open-source comprehensive suite of benchmarks for the financial domain. These include new benchmarks across 5 NLP tasks in financial domain as well as common benchmarks used in the previous research. Experiments on these benchmarks suggest that our model outperforms those in prior literature on a variety of NLP tasks. Our models, code and benchmark data are publicly available on Github and Huggingface.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2211.00083

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > British Indian Ocean Territory > Diego Garcia (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

How Masterly Are People at Playing with Their Vocabulary? Analysis of the Wordle Game for Latvian

Rikters, Matīss, Reinsone, Sanita

arXiv.org Artificial IntelligenceOct-4-2022

In this paper, we describe adaptation of a simple word guessing game that occupied the hearts and minds of people around the world. There are versions for all three Baltic countries and even several versions of each. We specifically pay attention to the Latvian version and look into how people form their guesses given any already uncovered hints. The paper analyses guess patterns, easy and difficult word characteristics, and player behaviour and response.

artificial intelligence, social media, wordle, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.22364/bjmc.2022.10.3.11

2210.01508

Country:

Europe > Latvia (0.15)
North America > Canada > British Columbia (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
(5 more...)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Games > Computer Games (0.71)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Communications > Social Media (0.69)

Add feedback