Information Retrieval
DyREx: Dynamic Query Representation for Extractive Question Answering
Zaratiana, Urchade, Khbir, Niama El, Nรบรฑez, Dennis, Holat, Pierre, Tomeh, Nadi, Charnois, Thierry
Extractive question answering (ExQA) is an essential task for Natural Language Processing. The dominant approach to ExQA is one that represents the input sequence tokens (question and passage) with a pre-trained transformer, then uses two learned query vectors to compute distributions over the start and end answer span positions. These query vectors lack the context of the inputs, which can be a bottleneck for the model performance. To address this problem, we propose \textit{DyREx}, a generalization of the \textit{vanilla} approach where we dynamically compute query vectors given the input, using an attention mechanism through transformer layers. Empirical observations demonstrate that our approach consistently improves the performance over the standard one. The code and accompanying files for running the experiments are available at \url{https://github.com/urchade/DyReX}.
NeuralSearchX: Serving a Multi-billion-parameter Reranker for Multilingual Metasearch at a Low Cost
Almeida, Thales Sales, Laitz, Thiago, Serรณdio, Joรฃo, Bonifacio, Luiz Henrique, Lotufo, Roberto, Nogueira, Rodrigo
The widespread availability of search API's (both free and commercial) brings the promise of increased coverage and quality of search results for metasearch engines, while decreasing the maintenance costs of the crawling and indexing infrastructures. However, merging strategies frequently comprise complex pipelines that require careful tuning, which is often overlooked in the literature. In this work, we describe NeuralSearchX, a metasearch engine based on a multi-purpose large reranking model to merge results and highlight sentences. Due to the homogeneity of our architecture, we could focus our optimization efforts on a single component. We compare our system with Microsoft's Biomedical Search and show that our design choices led to a much cost-effective system with competitive QPS while having close to state-of-the-art results on a wide range of public benchmarks. Human evaluation on two domain-specific tasks shows that our retrieval system outperformed Google API by a large margin in terms of nDCG@10 scores. By describing our architecture and implementation in detail, we hope that the community will build on our design choices.
A search engine for shapes
Born and raised in Shanghai, Tan came to MIT to study high-energy astrophysics and wrote his dissertation on computational modeling of neutron stars. "Coming from China at that time, I had very little experience with computers," he says. "I was fortunate to find many helpful students during my time there." Tan also met his wife, Hong (Zhang) Tan, SM '88, PhD '96, at MIT. The pair were married in the MIT Chapel and today have two sons.
Bilingual Lexicon Induction for Low-Resource Languages using Graph Matching via Optimal Transport
Marchisio, Kelly, Saad-Eldin, Ali, Duh, Kevin, Priebe, Carey, Koehn, Philipp
Bilingual lexicons form a critical component of various natural language processing applications, including unsupervised and semisupervised machine translation and crosslingual information retrieval. We improve bilingual lexicon induction performance across 40 language pairs with a graph-matching method based on optimal transport. The method is especially strong with low amounts of supervision.
CrisisLTLSum: A Benchmark for Local Crisis Event Timeline Extraction and Summarization
Faghihi, Hossein Rajaby, Alhafni, Bashar, Zhang, Ke, Ran, Shihao, Tetreault, Joel, Jaimes, Alejandro
Social media has increasingly played a key role in emergency response: first responders can use public posts to better react to ongoing crisis events and deploy the necessary resources where they are most needed. Timeline extraction and abstractive summarization are critical technical tasks to leverage large numbers of social media posts about events. Unfortunately, there are few datasets for benchmarking technical approaches for those tasks. This paper presents CrisisLTLSum, the largest dataset of local crisis event timelines available to date. CrisisLTLSum contains 1,000 crisis event timelines across four domains: wildfires, local fires, traffic, and storms. We built CrisisLTLSum using a semi-automated cluster-then-refine approach to collect data from the public Twitter stream. Our initial experiments indicate a significant gap between the performance of strong baselines compared to the human performance on both tasks. Our dataset, code, and models are publicly available.
SciFact-Open: Towards open-domain scientific claim verification
Wadden, David, Lo, Kyle, Kuehl, Bailey, Cohan, Arman, Beltagy, Iz, Wang, Lucy Lu, Hajishirzi, Hannaneh
While research on scientific claim verification has led to the development of powerful systems that appear to approach human performance, these approaches have yet to be tested in a realistic setting against large corpora of scientific literature. Moving to this open-domain evaluation setting, however, poses unique challenges; in particular, it is infeasible to exhaustively annotate all evidence documents. In this work, we present SciFact-Open, a new test collection designed to evaluate the performance of scientific claim verification systems on a corpus of 500K research abstracts. Drawing upon pooling techniques from information retrieval, we collect evidence for scientific claims by pooling and annotating the top predictions of four state-of-the-art scientific claim verification models. We find that systems developed on smaller corpora struggle to generalize to SciFact-Open, exhibiting performance drops of at least 15 F1. In addition, analysis of the evidence in SciFact-Open reveals interesting phenomena likely to appear when claim verification systems are deployed in practice, e.g., cases where the evidence supports only a special case of the claim. Our dataset is available at https://github.com/dwadden/scifact-open.
Enhancing Label Consistency on Document-level Named Entity Recognition
Named entity recognition (NER) is a fundamental part of extracting information from documents in biomedical applications. A notable advantage of NER is its consistency in extracting biomedical entities in a document context. Although existing document NER models show consistent predictions, they still do not meet our expectations. We investigated whether the adjectives and prepositions within an entity cause a low label consistency, which results in inconsistent predictions. In this paper, we present our method, ConNER, which enhances the label dependency of modifiers (e.g., adjectives and prepositions) to achieve higher label agreement. ConNER refines the draft labels of the modifiers to improve the output representations of biomedical entities. The effectiveness of our method is demonstrated on four popular biomedical NER datasets; in particular, its efficacy is proved on two datasets with 7.5-8.6% absolute improvements in the F1 score. We interpret that our ConNER method is effective on datasets that have intrinsically low label consistency. In the qualitative analysis, we demonstrate how our approach makes the NER model generate consistent predictions. Our code and resources are available at https://github.com/dmis-lab/ConNER/.
Research on Cross-media Science and Technology Information Data Retrieval
Jiang, Yang, Xue, Zhe, Li, Ang
Since the era of big data, the Internet has been flooded with all kinds of information. Browsing information through the Internet has become an integral part of people's daily life. Unlike the news data and social data in the Internet, the cross-media technology information data has different characteristics. This data has become an important basis for researchers and scholars to track the current hot spots and explore the future direction of technology development. As the volume of science and technology information data becomes richer, the traditional science and technology information retrieval system, which only supports unimodal data retrieval and uses outdated data keyword matching model, can no longer meet the daily retrieval needs of science and technology scholars. Therefore, in view of the above research background, it is of profound practical significance to study the cross-media science and technology information data retrieval system based on deep semantic features, which is in line with the development trend of domestic and international technologies.
Improving Chinese Named Entity Recognition by Search Engine Augmentation
Mao, Qinghua, Li, Jiatong, Meng, Kui
Compared with English, Chinese suffers from more grammatical ambiguities, like fuzzy word boundaries and polysemous words. In this case, contextual information is not sufficient to support Chinese named entity recognition (NER), especially for rare and emerging named entities. Semantic augmentation using external knowledge is a potential way to alleviate this problem, while how to obtain and leverage external knowledge for the NER task remains a challenge. In this paper, we propose a neural-based approach to perform semantic augmentation using external knowledge from search engine for Chinese NER. In particular, a multi-channel semantic fusion model is adopted to generate the augmented input representations, which aggregates external related texts retrieved from the search engine. Experiments have shown the superiority of our model across 4 NER datasets, including formal and social media language contexts, which further prove the effectiveness of our approach.
[100%OFF] Build A Search Engine With Python: Computer Science & Python
Many of the online courses teach you to code but not the theory/way of thinking behind it why would we choose a while but not a for loop, why should we pass 2 parameters to a function but not only one? We provide a platform for thousands of people to expand the understanding of programming and computer science. Founded in 2013 our mission is to spread the love for programming. To achieve this, we're working hard on providing content that will help people build a solid foundation in those subjects. This course will help you to master the foundation and know-how to solve problems with Python code.