Goto

Collaborating Authors

 tydi qa


Poolingformer: Long Document Modeling with Pooling Attention

arXiv.org Artificial Intelligence

In this paper, we introduce a two-level attention schema, Poolingformer, for long document modeling. Its first level uses a smaller sliding window pattern to aggregate information from neighbors. Its second level employs a larger window to increase receptive fields with pooling attention to reduce both computational cost and memory consumption. We first evaluate Poolingformer on two long sequence QA tasks: the monolingual NQ and the multilingual TyDi QA. Experimental results show that Poolingformer sits atop three official leaderboards measured by F1, outperforming previous state-of-the-art models by 1.9 points (79.8 vs. 77.9) on NQ long answer, 1.9 points (79.5 vs. 77.6) on TyDi QA passage answer, and 1.6 points (67.6 vs. 66.0) on TyDi QA minimal answer. We further evaluate Poolingformer on a long sequence summarization task. Experimental results on the arXiv benchmark continue to demonstrate its superior performance.


Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages

arXiv.org Artificial Intelligence

MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual dataset we have built for the WSDM 2023 Cup challenge that focuses on ad hoc retrieval across 18 different languages, which collectively encompass over three billion native speakers around the world. These languages have diverse typologies, originate from many different language families, and are associated with varying amounts of available resources -- including what researchers typically characterize as high-resource as well as low-resource languages. Our dataset is designed to support the creation and evaluation of models for monolingual retrieval, where the queries and the corpora are in the same language. In total, we have gathered over 700k high-quality relevance judgments for around 77k queries over Wikipedia in these 18 languages, where all assessments have been performed by native speakers hired by our team. Our goal is to spur research that will improve retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have been traditionally underserved. This overview paper describes the dataset and baselines that we share with the community. The MIRACL website is live at http://miracl.ai/.


Google releases TyDi QA, a data set that aims to capture the uniqueness of languages

#artificialintelligence

Google hopes to spur the development of AI capable of understanding the ways in which languages express different meanings. To this end, company researchers today detailed a data set -- TyDi QA, a question-answering data set covering 11 languages -- inspired by typological diversity, or the notion that different languages express meaning in structurally unique ways. TyDi QA is something of a complement to the English-language Natural Questions corpus Google released last year, and it attempts to capture t he idiosyncrasies and features of tongues like Japanese and Arabic. The researchers point out, for instance, that English changes words to indicate one object ("book") versus many ("books"), and that Arabic has a third form to indicate if there are two of something ("كتابان", kitaban) beyond just singular ("كتاب", kitab) or plural ("كتب", kutub). "Because we selected a set of languages that are typologically distant from each other for this corpus, we expect models performing well on this dataset to generalize across a large number of the languages in the world," wrote Google Research scientist Jonathan Clark in a blog post.