AITopics

Country: Europe > Luxembourg (0.14)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.34)

Neural Information Processing SystemsFeb-17-2026, 17:11:55 GMT

A Appendix

Usage Application The model is intended for open-vocabulary object detection.Known Caveats

artificial intelligence, machine learning, social media, (17 more...)

Country: North America (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media (0.69)

Neural Information Processing SystemsDec-25-2025, 07:33:17 GMT

Autoregressive Search Engines: Generating Substrings as Document Identifiers

Knowledge-intensive language tasks require NLP systems to both provide the correct answer and retrieve supporting evidence for it in a given corpus. Autoregressive language models are emerging as the de-facto standard for generating answers, with newer and more powerful systems emerging at an astonishing pace. In this paper we argue that all this (and future) progress can be directly applied to the retrieval problem with minimal intervention to the models' architecture. Previous work has explored ways to partition the search space into hierarchical structures and retrieve documents by autoregressively generating their unique identifier. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers. This setup allows us to use an autoregressive model to generate and score distinctive ngrams, that are then mapped to full passages through an efficient data structure. Empirically, we show this not only outperforms prior autoregressive approaches but also leads to an average improvement of at least 10 points over more established retrieval solutions for passage-level retrieval on the KILT benchmark, establishing new state-of-the-art downstream performance on some datasets, while using a considerably lighter memory footprint than competing systems.

autoregressive search engine, generating substring, name change, (4 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.82)

Neural Information Processing SystemsSep-26-2025, 14:12:15 GMT

cd88d62a2063fdaf7ce6f9068fb15dcd-Paper-Conference.pdf

computational linguistic, information retrieval, large language model, (18 more...)

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Energy > Oil & Gas (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Information Management (0.93)
(3 more...)

Neural Information Processing SystemsAug-19-2025, 00:22:00 GMT

Autoregressive Search Engines: Generating Substrings as Document Identifiers Michele Bevilacqua 1,2 Giuseppe Ottaviano 2 Patrick Lewis 2 Wen-tau Yih 2

Knowledge-intensive language tasks require NLP systems to both provide the correct answer and retrieve supporting evidence for it in a given corpus.

computational linguistic, large language model, machine learning, (19 more...)

Country:

North America > Canada (0.04)
North America > Dominican Republic (0.04)
Africa > Eswatini > Manzini > Manzini (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(2 more...)

arXiv.org Artificial IntelligenceAug-1-2025

Holistic Evaluations of Topic Models

Compton, Thomas

Topic models are gaining increasing commercial and academic interest for their ability to summarize large volumes of unstructured text. As unsupervised machine learning methods, they enable researchers to explore data and help general users understand key themes in large text collections. However, they risk becoming a 'black box', where users input data and accept the output as an accurate summary without scrutiny. This article evaluates topic models from a database perspective, drawing insights from 1140 BERTopic model runs. The goal is to identify trade-offs in optimizing model parameters and to reflect on what these findings mean for the interpretation and responsible use of topic models

machine learning, natural language, topic model, (21 more...)

2507.23364

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)

Khachaturov, David, Mullins, Robert, Shumailov, Ilia, Dathathri, Sumanth

Watermarking Needs Input Repetition Masking

arXiv.org Artificial IntelligenceApr-17-2025

Recent advancements in Large Language Models (LLMs) raised concerns over potential misuse, such as for spreading misinformation. In response two counter measures emerged: machine learning-based detectors that predict if text is synthetic, and LLM watermarking, which subtly marks generated text for identification and attribution. Meanwhile, humans are known to adjust language to their conversational partners both syntactically and lexically. By implication, it is possible that humans or unwatermarked LLMs could unintentionally mimic properties of LLM generated text, making counter measures unreliable. In this work we investigate the extent to which such conversational adaptation happens. We call the concept $\textit{mimicry}$ and demonstrate that both humans and LLMs end up mimicking, including the watermarking signal even in seemingly improbable settings. This challenges current academic assumptions and suggests that for long-term watermarking to be reliable, the likelihood of false positives needs to be significantly lower, while longer word sequences should be used for seeding watermarking mechanisms.

large language model, machine learning, natural language, (20 more...)

2504.12229

Country: Asia (0.46)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsJan-18-2025, 22:10:42 GMT

Autoregressive Search Engines: Generating Substrings as Document Identifiers

autoregressive search engine, document identifier, generating substring, (2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.86)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.40)

arXiv.org Artificial IntelligenceNov-7-2024

Analyzing The Language of Visual Tokens

Chan, David M., Corona, Rodolfo, Park, Joonyong, Cho, Cheol Jun, Bai, Yutong, Darrell, Trevor

With the introduction of transformer-based models for vision and language tasks, such as LLaVA and Chameleon, there has been renewed interest in the discrete tokenized representation of images. These models often treat image patches as discrete tokens, analogous to words in natural language, learning joint alignments between visual and human languages. However, little is known about the statistical behavior of these visual languages - whether they follow similar frequency distributions, grammatical structures, or topologies as natural languages. In this paper, we take a natural-language-centric approach to analyzing discrete visual languages and uncover striking similarities and fundamental differences. We demonstrate that, although visual languages adhere to Zipfian distributions, higher token innovation drives greater entropy and lower compression, with tokens predominantly representing object parts, indicating intermediate granularity. We also show that visual languages lack cohesive grammatical structures, leading to higher perplexity and weaker hierarchical organization compared to natural languages. Finally, we demonstrate that, while vision models align more closely with natural languages than other models, this alignment remains significantly weaker than the cohesion found within natural languages. Through these experiments, we demonstrate how understanding the statistical properties of discrete visual languages can inform the design of more effective computer vision models.

frequency, natural language, vq-vae, (15 more...)

2411.05001

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(6 more...)

Genre: Research Report > New Finding (0.67)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Levy, Avivit, Shalom, B. Riva, Chalamish, Michal

A Guide to Similarity Measures

arXiv.org Artificial IntelligenceAug-7-2024

Similarity measures play a central role in various data science application domains for a wide assortment of tasks. This guide describes a comprehensive set of prevalent similarity measures to serve both non-experts and professional. Non-experts that wish to understand the motivation for a measure as well as how to use it may find a friendly and detailed exposition of the formulas of the measures, whereas experts may find a glance to the principles of designing similarity measures and ideas for a better way to measure similarity for their desired task in a given application domain.

distance measure, similarity, vector, (17 more...)

2408.07706

Country:

Oceania > Australia > Australian Capital Territory > Canberra (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Florida > Hillsborough County > Tampa (0.04)
(2 more...)

Genre:

Overview (0.67)
Research Report (0.64)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.92)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)