Mayfield, James
On the Evaluation of Machine-Generated Reports
Mayfield, James, Yang, Eugene, Lawrie, Dawn, MacAvaney, Sean, McNamee, Paul, Oard, Douglas W., Soldaini, Luca, Soboroff, Ian, Weller, Orion, Kayi, Efsun, Sanders, Kate, Mason, Marc, Hibbler, Noah
Large Language Models (LLMs) have enabled new ways to satisfy information needs. Although great strides have been made in applying them to settings like document ranking and short-form text generation, they still struggle to compose complete, accurate, and verifiable long-form reports. Reports with these qualities are necessary to satisfy the complex, nuanced, or multi-faceted information needs of users. In this perspective paper, we draw together opinions from industry and academia, and from a variety of related research areas, to present our vision for automatic report generation, and -- critically -- a flexible framework by which such reports can be evaluated. In contrast with other summarization tasks, automatic report generation starts with a detailed description of an information need, stating the necessary background, requirements, and scope of the report. Further, the generated reports should be complete, accurate, and verifiable. These qualities, which are desirable -- if not required -- in many analytic report-writing settings, require rethinking how to build and evaluate systems that exhibit these qualities. To foster new efforts in building these systems, we present an evaluation framework that draws on ideas found in various evaluations. To test completeness and accuracy, the framework uses nuggets of information, expressed as questions and answers, that need to be part of any high-quality generated report. Additionally, evaluation of citations that map claims made in the report to their source documents ensures verifiability.
PLAID SHIRTTT for Large-Scale Streaming Dense Retrieval
Lawrie, Dawn, Kayi, Efsun, Yang, Eugene, Mayfield, James, Oard, Douglas W.
PLAID, an efficient implementation of the ColBERT late interaction bi-encoder using pretrained language models for ranking, consistently achieves state-of-the-art performance in monolingual, cross-language, and multilingual retrieval. PLAID differs from ColBERT by assigning terms to clusters and representing those terms as cluster centroids plus compressed residual vectors. While PLAID is effective in batch experiments, its performance degrades in streaming settings where documents arrive over time because representations of new tokens may be poorly modeled by the earlier tokens used to select cluster centroids. PLAID Streaming Hierarchical Indexing that Runs on Terabytes of Temporal Text (PLAID SHIRTTT) addresses this concern using multi-phase incremental indexing based on hierarchical sharding. Experiments on ClueWeb09 and the multilingual NeuCLIR collection demonstrate the effectiveness of this approach both for the largest collection indexed to date by the ColBERT architecture and in the multilingual setting, respectively.
HLTCOE at TREC 2023 NeuCLIR Track
Yang, Eugene, Lawrie, Dawn, Mayfield, James
The HLTCOE team applied PLAID, an mT5 reranker, and document translation to the TREC 2023 NeuCLIR track. For PLAID we included a variety of models and training techniques -- the English model released with ColBERT v2, translate-train~(TT), Translate Distill~(TD) and multilingual translate-train~(MTT). TT trains a ColBERT model with English queries and passages automatically translated into the document language from the MS-MARCO v1 collection. This results in three cross-language models for the track, one per language. MTT creates a single model for all three document languages by combining the translations of MS-MARCO passages in all three languages into mixed-language batches. Thus the model learns about matching queries to passages simultaneously in all languages. Distillation uses scores from the mT5 model over non-English translated document pairs to learn how to score query-document pairs. The team submitted runs to all NeuCLIR tasks: the CLIR and MLIR news task as well as the technical documents task.
Extending Translate-Train for ColBERT-X to African Language CLIR
Yang, Eugene, Lawrie, Dawn J., McNamee, Paul, Mayfield, James
This paper describes the submission runs from the HLTCOE team at the CIRAL CLIR tasks for African languages at FIRE 2023. Our submissions use machine translation models to translate the documents and the training passages, and ColBERT-X as the retrieval model. Additionally, we present a set of unofficial runs that use an alternative training procedure with a similar training setting.
Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation
Yang, Eugene, Lawrie, Dawn, Mayfield, James, Oard, Douglas W., Miller, Scott
Prior work on English monolingual retrieval has shown that a cross-encoder trained using a large number of relevance judgments for query-document pairs can be used as a teacher to train more efficient, but similarly effective, dual-encoder student models. Applying a similar knowledge distillation approach to training an efficient dual-encoder model for Cross-Language Information Retrieval (CLIR), where queries and documents are in different languages, is challenging due to the lack of a sufficiently large training collection when the query and document languages differ. The state of the art for CLIR thus relies on translating queries, documents, or both from the large English MS MARCO training set, an approach called Translate-Train. This paper proposes an alternative, Translate-Distill, in which knowledge distillation from either a monolingual cross-encoder or a CLIR cross-encoder is used to train a dual-encoder CLIR student model. This richer design space enables the teacher model to perform inference in an optimized setting, while training the student model directly for CLIR. Trained models and artifacts are publicly available on Huggingface.
Neural Approaches to Multilingual Information Retrieval
Lawrie, Dawn, Yang, Eugene, Oard, Douglas W., Mayfield, James
Providing access to information across languages has been a goal of Information Retrieval (IR) for decades. While progress has been made on Cross Language IR (CLIR) where queries are expressed in one language and documents in another, the multilingual (MLIR) task to create a single ranked list of documents across many languages is considerably more challenging. This paper investigates whether advances in neural document translation and pretrained multilingual neural language models enable improvements in the state of the art over earlier MLIR techniques. The results show that although combining neural document translation with neural ranking yields the best Mean Average Precision (MAP), 98% of that MAP score can be achieved with an 84% reduction in indexing time by using a pretrained XLM-R multilingual language model to index documents in their native language, and that 2% difference in effectiveness is not statistically significant. Key to achieving these results for MLIR is to fine-tune XLM-R using mixed-language batches from neural translations of MS MARCO passages.
Parameter-efficient Zero-shot Transfer for Cross-Language Dense Retrieval with Adapters
Yang, Eugene, Nair, Suraj, Lawrie, Dawn, Mayfield, James, Oard, Douglas W.
A popular approach to creating a zero-shot cross-language retrieval model is to substitute a monolingual pretrained language model in the retrieval model with a multilingual pretrained language model such as Multilingual BERT. This multilingual model is fined-tuned to the retrieval task with monolingual data such as English MS MARCO using the same training recipe as the monolingual retrieval model used. However, such transferred models suffer from mismatches in the languages of the input text during training and inference. In this work, we propose transferring monolingual retrieval models using adapters, a parameter-efficient component for a transformer network. By adding adapters pretrained on language tasks for a specific language with task-specific adapters, prior work has shown that the adapter-enhanced models perform better than fine-tuning the entire model when transferring across languages in various NLP tasks. By constructing dense retrieval models with adapters, we show that models trained with monolingual data are more effective than fine-tuning the entire model when transferring to a Cross Language Information Retrieval (CLIR) setting. However, we found that the prior suggestion of replacing the language adapters to match the target language at inference time is suboptimal for dense retrieval models. We provide an in-depth analysis of this discrepancy between other cross-language NLP tasks and CLIR.
Gazetteer Generation for Neural Named Entity Recognition
Song, Chan Hee (University of Notre Dame ) | Lawrie, Dawn (John Hopkins University) | Finin, Tim (University of Maryland Baltimore County) | Mayfield, James (John Hopkins University)
We present a way to generate gazetteers from the Wikidata knowledge graph and use the lists to improve a neural NER system by adding an input feature indicating that a word is part of a name in the gazetteer. We empirically show that the approach yields performance gains in two distinct languages: a high-resource, word-based language, English and a high-resource, character-based language, Chinese. We apply the approach to a low-resource language, Russian, using a new annotated Russian NER corpus from Reddit tagged with four core and eleven extended types, and show a baseline score.
High Recall Text Classification for Public Health Systematic Review
McNamee, Paul (Johns Hopkins University) | Mayfield, James (Johns Hopkins University) | Rowe, Samantha Y. (U.S. Centers for Disease Control and Prevention) | Rowe, Alexander K. (U.S. Centers for Disease Control and Prevention) | Jackson, Hannah L. (U.S. Centers for Disease Control and Prevention) | Baker, Megan (Johns Hopkins University)
Some information retrieval applications demand manageable levels of precision at high levels of recall. Examples include e-discovery, patent search, and systematic review. In this paper we present a real-world case study supporting a broad topic systematic review in the public health domain. We provide experimental results that demonstrate how retrieval performance on bibliographic citations can be materially improved. We attained an average precision of 0.57 and recall approaching 80% at a very reasonable screening depth. These results represent 18% and 23% relative gains over a baseline classifier. We also address pragmatic issues that arise when working on “noisy” real-world data, such as coping with citation records that often have empty fields.
KELVIN: Extracting Knowledge from Large Text Collections
Mayfield, James (Johns Hopkins Applied Physics Laboratory) | McNamee, Paul (Johns Hopkins Applied Physics Laboratory) | Harman, Craig (Johns Hopkins University) | Finin, Tim (University of Maryland, Baltimore County) | Lawrie, Dawn (Loyola University Maryland)
We describe the KELVIN system for extracting entities and relations from large text collections and its use in the TAC Knowledge Base Population Cold Start task run by the U.S. National Institute of Standards and Technology. The Cold Start task starts with an empty knowledge base defined by an ontology or entity types, properties and relations. Evaluations in 2012 and 2013 were done using a collection of text from local Web and news to de-emphasize the linking entities to a background knowledge bases such as Wikipedia. Interesting features of KELVIN include a cross-document entity coreference module based on entity mentions, removal of suspect intra-document conference chains, a slot value consolidator for entities, the application of inference rules to expand the number of asserted facts and a set of analysis and browsing tools supporting development.