Goto

Collaborating Authors

 dieng


Vendi Information Gain for Active Learning and its Application to Ecology

Nguyen, Quan, Dieng, Adji Bousso

arXiv.org Artificial Intelligence

While monitoring biodiversity through camera traps has become an important endeavor for ecological research, identifying species in the captured image data remains a major bottleneck due to limited labeling resources. Active learning -- a machine learning paradigm that selects the most informative data to label and train a predictive model -- offers a promising solution, but typically focuses on uncertainty in the individual predictions without considering uncertainty across the entire dataset. We introduce a new active learning policy, Vendi information gain (VIG), that selects images based on their impact on dataset-wide prediction uncertainty, capturing both informativeness and diversity. We applied VIG to the Snapshot Serengeti dataset and compared it against common active learning methods. VIG needs only 3% of the available data to reach 75% accuracy, a level that baselines require more than 10% of the data to achieve. With 10% of the data, VIG attains 88% predictive accuracy, 12% higher than the best of the baselines. This improvement in performance is consistent across metrics and batch sizes, and we show that VIG also collects more diverse data in the feature space. VIG has broad applicability beyond ecology, and our results highlight its value for biodiversity monitoring in data-limited environments.


Vendi-RAG: Adaptively Trading-Off Diversity And Quality Significantly Improves Retrieval Augmented Generation With LLMs

Rezaei, Mohammad Reza, Dieng, Adji Bousso

arXiv.org Artificial Intelligence

Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires connecting information from multiple sources. This paper introduces Vendi-RAG, a framework based on an iterative process that jointly optimizes retrieval diversity and answer quality. This joint optimization leads to significantly higher accuracy for multi-hop QA tasks. Vendi-RAG leverages the Vendi Score (VS), a flexible similarity-based diversity metric, to promote semantic diversity in document retrieval. It then uses an LLM judge that evaluates candidate answers, generated after a reasoning step, and outputs a score that the retriever uses to balance relevance and diversity among the retrieved documents during each iteration. Experiments on three challenging datasets -- HotpotQA, MuSiQue, and 2WikiMultiHopQA -- demonstrate Vendi-RAG's effectiveness in multi-hop reasoning tasks. The framework achieves significant accuracy improvements over traditional single-step and multi-step RAG approaches, with accuracy increases reaching up to +4.2% on HotpotQA, +4.1% on 2WikiMultiHopQA, and +1.3% on MuSiQue compared to Adaptive-RAG, the current best baseline. The benefits of Vendi-RAG are even more pronounced as the number of retrieved documents increases. Finally, we evaluated Vendi-RAG across different LLM backbones, including GPT-3.5, GPT-4, and GPT-4o-mini, and observed consistent improvements, demonstrating that the framework's advantages are model-agnostic.


The Vendiscope: An Algorithmic Microscope For Data Collections

Pasarkar, Amey P., Dieng, Adji Bousso

arXiv.org Artificial Intelligence

The evolution of microscopy, beginning with its invention in the late 16th century, has continuously enhanced our ability to explore and understand the microscopic world, enabling increasingly detailed observations of structures and phenomena. In parallel, the rise of data-driven science has underscored the need for sophisticated methods to explore and understand the composition of complex data collections. This paper introduces the Vendiscope, the first algorithmic microscope designed to extend traditional microscopy to computational analysis. The Vendiscope leverages the Vendi scores -- a family of differentiable diversity metrics rooted in ecology and quantum mechanics -- and assigns weights to data points based on their contribution to the overall diversity of the collection. These weights enable high-resolution data analysis at scale. We demonstrate this across biology, materials science, and machine learning (ML). We analyzed the $250$ million protein sequences in the protein universe, discovering that over $200$ million are near-duplicates and that AlphaFold fails on proteins with Gene Ontology (GO) functions that contribute most to diversity. Applying the Vendiscope to the Materials Project database led to similar findings: more than $85\%$ of the crystals with formation energy data are near-duplicates and ML models perform poorly on materials that enhance diversity. Additionally, the Vendiscope can be used to study phenomena such as memorization in generative models. We used the Vendiscope to identify memorized training samples from $13$ different generative models and found that the best-performing ones often memorize the training samples that contribute least to diversity. Our findings demonstrate that the Vendiscope can serve as a powerful tool for data-driven science.


The $\alpha$-Alternator: Dynamic Adaptation To Varying Noise Levels In Sequences Using The Vendi Score For Improved Robustness and Performance

Rezaei, Mohammad Reza, Dieng, Adji Bousso

arXiv.org Machine Learning

Current state-of-the-art dynamical models, such as Mamba, assume the same level of noisiness for all elements of a given sequence, which limits their performance on noisy temporal data. In this paper, we introduce the $\alpha$-Alternator, a novel generative model for time-dependent data that dynamically adapts to the complexity introduced by varying noise levels in sequences. The $\alpha$-Alternator leverages the Vendi Score (VS), a flexible similarity-based diversity metric, to adjust, at each time step $t$, the influence of the sequence element at time $t$ and the latent representation of the dynamics up to that time step on the predicted future dynamics. This influence is captured by a parameter that is learned and shared across all sequences in a given dataset. The sign of this parameter determines the direction of influence. A negative value indicates a noisy dataset, where a sequence element that increases the VS is considered noisy, and the model relies more on the latent history when processing that element. Conversely, when the parameter is positive, a sequence element that increases the VS is considered informative, and the $\alpha$-Alternator relies more on this new input than on the latent history when updating its predicted latent dynamics. The $\alpha$-Alternator is trained using a combination of observation masking and Alternator loss minimization. Masking simulates varying noise levels in sequences, enabling the model to be more robust to these fluctuations and improving its performance in trajectory prediction, imputation, and forecasting. Our experimental results demonstrate that the $\alpha$-Alternator outperforms both Alternators and state-of-the-art state-space models across neural decoding and time-series forecasting benchmarks.


Topic-VQ-VAE: Leveraging Latent Codebooks for Flexible Topic-Guided Document Generation

Yoo, YoungJoon, Choi, Jongwon

arXiv.org Artificial Intelligence

This paper introduces a novel approach for topic modeling utilizing latent codebooks from Vector-Quantized Variational Auto-Encoder~(VQ-VAE), discretely encapsulating the rich information of the pre-trained embeddings such as the pre-trained language model. From the novel interpretation of the latent codebooks and embeddings as conceptual bag-of-words, we propose a new generative topic model called Topic-VQ-VAE~(TVQ-VAE) which inversely generates the original documents related to the respective latent codebook. The TVQ-VAE can visualize the topics with various generative distributions including the traditional BoW distribution and the autoregressive image generation. Our experimental results on document analysis and image generation demonstrate that TVQ-VAE effectively captures the topic context which reveals the underlying structures of the dataset and supports flexible forms of document generation. Official implementation of the proposed TVQ-VAE is available at https://github.com/clovaai/TVQ-VAE.


Neural Dynamic Focused Topic Model

Cvejoski, Kostadin, Sánchez, Ramsés J., Ojeda, César

arXiv.org Artificial Intelligence

Topic models and all their variants analyse text by learning meaningful representations through word co-occurrences. As pointed out by Williamson et al. (2010), such models implicitly assume that the probability of a topic to be active and its proportion within each document are positively correlated. This correlation can be strongly detrimental in the case of documents created over time, simply because recent documents are likely better described by new and hence rare topics. In this work we leverage recent advances in neural variational inference and present an alternative neural approach to the dynamic Focused Topic Model. Indeed, we develop a neural model for topic evolution which exploits sequences of Bernoulli random variables in order to track the appearances of topics, thereby decoupling their activities from their proportions. We evaluate our model on three different datasets (the UN general debates, the collection of NeurIPS papers, and the ACL Anthology dataset) and show that it (i) outperforms state-of-the-art topic models in generalization tasks and (ii) performs comparably to them on prediction tasks, while employing roughly the same number of parameters, and converging about two times faster. Source code to reproduce our experiments is available online.


The AI Outlook

#artificialintelligence

Five global AI experts weigh in on the challenges they anticipate the technology will face, and potentially overcome, this year. Advances in artificial intelligence (AI) continue to happen at an extraordinary pace. News of breakthroughs in machine learning, computer vision, data science, and machine-human interaction come almost daily. Growth is massive, and it is impacting all sectors. According to a new forecast from the technology research company Gartner, worldwide artificial intelligence (AI) software revenue is forecast to total $62.5 billion in 2022, an increase of 21.3% from 2021.


This AI Expert From Senegal Is Helping Showcase Africans In STEM

#artificialintelligence

Not only has Adji Bousso Dieng, an AI researcher from Senegal, contributed to the field of generative modeling and about to become one of the first black female faculty in Computer Science in the Ivy League, she is also helping Africans in STEM tell their own success stories. Dieng, who is currently a researcher at Google and an incoming computer science faculty at Princeton, works in an area of Artificial Intelligence called generative modeling. "It allows you to learn from data without needing any supervision," she said, "Generative models have many real-world applications with regard to natural language processing, computer vision, healthcare, robotics, and in a range of sciences." In addition to this, Dieng started The Africa I Know (TAIK), a platform that showcases Africans who've had successful careers; highlight how Africans are leveraging technology to solve developmental problems –in agriculture, health and education– and narrate African history as told by Africans. "I founded TAIK to unearth the success stories of Africa and its people and to foster an economic and social consciousness in Africa," she said, adding that TAIK's volunteers are a group of eager and young Africans coming from every region of the continent and that the content is in both English and French.