AITopics | Lin, Jimmy

Collaborating Authors

Lin, Jimmy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration

Piktus, Aleksandra, Ogundepo, Odunayo, Akiki, Christopher, Oladipo, Akintunde, Zhang, Xinyu, Schoelkopf, Hailey, Biderman, Stella, Potthast, Martin, Lin, Jimmy

arXiv.org Artificial IntelligenceJun-2-2023

Noticing the urgent need to provide tools for fast and user-friendly qualitative analysis of large-scale textual corpora of the modern NLP, we propose to turn to the mature and well-tested methods from the domain of Information Retrieval (IR) - a research field with a long history of tackling TB-scale document collections. We discuss how Pyserini - a widely used toolkit for reproducible IR research can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts. We leverage the existing functionalities of both platforms while proposing novel features further facilitating their integration. Our goal is to give NLP researchers tools that will allow them to develop retrieval-based instrumentation for their data analytics needs with ease and agility. We include a Jupyter Notebook-based walk through the core interoperability features, available on GitHub at https://github.com/huggingface/gaia. We then demonstrate how the ideas we present can be operationalized to create a powerful tool for qualitative data analysis in NLP. We present GAIA Search - a search engine built following previously laid out principles, giving access to four popular large-scale text collections. GAIA serves a dual purpose of illustrating the potential of methodologies we discuss but also as a standalone qualitative analysis tool that can be leveraged by NLP researchers aiming to understand datasets prior to using them in training. GAIA is hosted live on Hugging Face Spaces - https://huggingface.co/spaces/spacerini/gaia.

information retrieval, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2306.01481

Country:

Europe (0.68)
North America > United States (0.46)

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(2 more...)

Add feedback

Regex-augmented Domain Transfer Topic Classification based on a Pre-trained Language Model: An application in Financial Domain

Liao, Vanessa, Murtaza, Syed Shariyar, Nie, Yifan, Lin, Jimmy

arXiv.org Artificial IntelligenceMay-22-2023

A common way to use large pre-trained language models for downstream tasks is to fine tune them using additional layers. This may not work well if downstream domain is a specialized domain whereas the large language model has been pre-trained on a generic corpus. In this paper, we discuss the use of regular expression patterns employed as features for domain knowledge during the process of fine tuning, in addition to domain specific text. Our experiments on real scenario production data show that this method of fine tuning improves the downstream text classification tasks as compared to fine tuning only on domain specific text. We also show that the use of attention network for fine tuning improves results compared to simple linear layers.

machine learning, natural language, text classification, (16 more...)

arXiv.org Artificial Intelligence

2305.18324

Country:

Asia (0.93)
North America > Canada > Ontario > Toronto (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Waterloo Region > Waterloo (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Banking & Finance (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.86)

Add feedback

How Does Generative Retrieval Scale to Millions of Passages?

Pradeep, Ronak, Hui, Kai, Gupta, Jai, Lelkes, Adam D., Zhuang, Honglei, Lin, Jimmy, Metzler, Donald, Tran, Vinh Q.

arXiv.org Artificial IntelligenceMay-19-2023

Popularized by the Differentiable Search Index, the emerging paradigm of generative retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus within a single Transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have only been evaluated on document corpora on the order of 100k in size. We conduct the first empirical study of generative retrieval techniques across various corpus scales, ultimately scaling up to the entire MS MARCO passage ranking task with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. We uncover several findings about scaling generative retrieval to millions of passages; notably, the central importance of using synthetic queries as document representations during indexing, the ineffectiveness of existing proposed architecture modifications when accounting for compute cost, and the limits of naively scaling model parameters with respect to retrieval performance. While we find that generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge. We believe these findings will be valuable for the community to clarify the current state of generative retrieval, highlight the unique challenges, and inspire new research directions.

information retrieval, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2305.11841

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

$SmartProbe$: A Virtual Moderator for Market Research Surveys

Seltzer, Josh, Pan, Jiahua, Cheng, Kathy, Sun, Yuxiao, Kolagati, Santosh, Lin, Jimmy, Zong, Shi

arXiv.org Artificial IntelligenceMay-14-2023

Market research surveys are a powerful methodology for understanding consumer perspectives at scale, but are limited by depth of understanding and insights. A virtual moderator can introduce elements of qualitative research into surveys, developing a rapport with survey participants and dynamically asking probing questions, ultimately to elicit more useful information for market researchers. In this work, we introduce ${\tt SmartProbe}$, an API which leverages the adaptive capabilities of large language models (LLMs), and incorporates domain knowledge from market research, in order to generate effective probing questions in any market research survey. We outline the modular processing flow of $\tt SmartProbe$, and evaluate the quality and effectiveness of its generated probing questions. We believe our efforts will inspire industry practitioners to build real-world applications based on the latest advances in LLMs. Our demo is publicly available at https://nexxt.in/smartprobe-demo

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2305.08271

Country:

North America > United States > Colorado (0.14)
North America > United States > California (0.14)

Genre:

Questionnaire & Opinion Survey (1.00)
Personal > Interview (0.69)
Research Report > New Finding (0.68)

Industry:

Marketing (1.00)
Government (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Zero-Shot Listwise Document Reranking with a Large Language Model

Ma, Xueguang, Zhang, Xinyu, Pradeep, Ronak, Lin, Jimmy

arXiv.org Artificial IntelligenceMay-3-2023

Supervised ranking methods based on bi-encoder or cross-encoder architectures have shown success in multi-stage text ranking tasks, but they require large amounts of relevance judgments as training data. In this work, we propose Listwise Reranker with a Large Language Model (LRL), which achieves strong reranking effectiveness without using any task-specific training data. Different from the existing pointwise ranking methods, where documents are scored independently and ranked according to the scores, LRL directly generates a reordered list of document identifiers given the candidate documents. Experiments on three TREC web search datasets demonstrate that LRL not only outperforms zero-shot pointwise methods when reranking first-stage retrieval results, but can also act as a final-stage reranker to improve the top-ranked results of a pointwise method for improved efficiency. Additionally, we apply our approach to subsets of MIRACL, a recent multilingual retrieval dataset, with results showing its potential to generalize across different languages.

machine learning, natural language, reranker, (17 more...)

arXiv.org Artificial Intelligence

2305.02156

Country:

North America > Canada (0.29)
Asia > Middle East (0.28)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation

Yang, Jheng-Hong, Lassance, Carlos, de Rezende, Rafael Sampaio, Srinivasan, Krishna, Redi, Miriam, Clinchant, Stéphane, Lin, Jimmy

arXiv.org Artificial IntelligenceApr-4-2023

This paper presents the AToMiC (Authoring Tools for Multimedia Content) dataset, designed to advance research in image/text cross-modal retrieval. While vision-language pretrained transformers have led to significant improvements in retrieval effectiveness, existing research has relied on image-caption datasets that feature only simplistic image-text relationships and underspecified user models of retrieval tasks. To address the gap between these oversimplified settings and real-world applications for multimedia content creation, we introduce a new approach for building retrieval test collections. We leverage hierarchical structures and diverse domains of texts, styles, and types of images, as well as large-scale image-document associations embedded in Wikipedia. We formulate two tasks based on a realistic user model and validate our dataset through retrieval experiments using baseline models. AToMiC offers a testbed for scalable, diverse, and reproducible multimedia retrieval research. Finally, the dataset provides the basis for a dedicated track at the 2023 Text Retrieval Conference (TREC), and is publicly available at https://github.com/TREC-AToMiC/AToMiC.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2304.01961

Country:

Europe (0.46)
North America (0.28)

Genre: Research Report > New Finding (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

Simple Yet Effective Neural Ranking and Reranking Baselines for Cross-Lingual Information Retrieval

Lin, Jimmy, Alfonso-Hermelo, David, Jeronymo, Vitor, Kamalloo, Ehsan, Lassance, Carlos, Nogueira, Rodrigo, Ogundepo, Odunayo, Rezagholizadeh, Mehdi, Thakur, Nandan, Yang, Jheng-Hong, Zhang, Xinyu

arXiv.org Artificial IntelligenceApr-3-2023

The advent of multilingual language models has generated a resurgence of interest in cross-lingual information retrieval (CLIR), which is the task of searching documents in one language with queries from another. However, the rapid pace of progress has led to a confusing panoply of methods and reproducibility has lagged behind the state of the art. In this context, our work makes two important contributions: First, we provide a conceptual framework for organizing different approaches to cross-lingual retrieval using multi-stage architectures for mono-lingual retrieval as a scaffold. Second, we implement simple yet effective reproducible baselines in the Anserini and Pyserini IR toolkits for test collections from the TREC 2022 NeuCLIR Track, in Persian, Russian, and Chinese. Our efforts are built on a collaboration of the two teams that submitted the most effective runs to the TREC evaluation. These contributions provide a firm foundation for future advances.

information retrieval, natural language, translation, (14 more...)

arXiv.org Artificial Intelligence

2304.01019

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face

Akiki, Christopher, Ogundepo, Odunayo, Piktus, Aleksandra, Zhang, Xinyu, Oladipo, Akintunde, Lin, Jimmy, Potthast, Martin

arXiv.org Artificial IntelligenceFeb-28-2023

We present Spacerini, a modular framework for seamless building and deployment of interactive search applications, designed to facilitate the qualitative analysis of large scale research datasets. Spacerini integrates features from both the Pyserini toolkit and the Hugging Face ecosystem to ease the indexing text collections and deploy them as search engines for ad-hoc exploration and to make the retrieval of relevant data points quick and efficient. The user-friendly interface enables searching through massive datasets in a no-code fashion, making Spacerini broadly accessible to anyone looking to qualitatively audit their text collections. This is useful both to IR~researchers aiming to demonstrate the capabilities of their indexes in a simple and interactive way, and to NLP~researchers looking to better understand and audit the failure modes of large language models. The framework is open source and available on GitHub: https://github.com/castorini/hf-spacerini, and includes utilities to load, pre-process, index, and deploy local and web search applications. A portfolio of applications created with Spacerini for a multitude of use cases can be found by visiting https://hf.co/spacerini.

artificial intelligence, information retrieval, natural language, (15 more...)

arXiv.org Artificial Intelligence

2302.14534

Country:

Europe (0.68)
North America (0.46)
Asia > Middle East (0.46)

Genre: Research Report (0.41)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval

Lin, Sheng-Chieh, Asai, Akari, Li, Minghan, Oguz, Barlas, Lin, Jimmy, Mehdad, Yashar, Yih, Wen-tau, Chen, Xilun

arXiv.org Artificial IntelligenceFeb-14-2023

Various techniques have been developed in recent years to improve dense retrieval (DR), such as unsupervised contrastive learning and pseudo-query generation. Existing DRs, however, often suffer from effectiveness tradeoffs between supervised and zero-shot retrieval, which some argue was due to the limited model capacity. We contradict this hypothesis and show that a generalizable DR can be trained to achieve high accuracy in both supervised and zero-shot retrieval without increasing model size. In particular, we systematically examine the contrastive learning of DRs, under the framework of Data Augmentation (DA). Our study shows that common DA practices such as query augmentation with generative models and pseudo-relevance label creation using a cross-encoder, are often inefficient and sub-optimal. We hence propose a new DA approach with diverse queries and sources of supervision to progressively train a generalizable DR. As a result, DRAGON, our dense retriever trained with diverse augmentation, is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations and even competes with models using more complex late interaction (ColBERTv2 and SPLADE++).

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2302.07452

Genre: Research Report (1.00)

Industry: Energy > Power Industry (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.77)

Add feedback

Improving Out-of-Distribution Generalization of Neural Rerankers with Contextualized Late Interaction

Zhang, Xinyu, Li, Minghan, Lin, Jimmy

arXiv.org Artificial IntelligenceFeb-13-2023

Recent progress in information retrieval finds that embedding query and document representation into multi-vector yields a robust bi-encoder retriever on out-of-distribution datasets. In this paper, we explore whether late interaction, the simplest form of multi-vector, is also helpful to neural rerankers that only use the [CLS] vector to compute the similarity score. Although intuitively, the attention mechanism of rerankers at the previous layers already gathers the token-level information, we find adding late interaction still brings an extra 5% improvement in average on out-of-distribution datasets, with little increase in latency and no degradation in in-domain effectiveness. Through extensive experiments and analysis, we show that the finding is consistent across different model sizes and first-stage retrievers of diverse natures and that the improvement is more prominent on longer queries.

information retrieval, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2302.06589

Country:

North America > United States (0.97)
Europe (0.68)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback