Information Retrieval
Generative User-Experience Research for Developing Domain-specific Natural Language Processing Applications
Zhukova, Anastasia, von Sperl, Lukas, Matt, Christian E., Gipp, Bela
Natural Language Processing (NLP) has been recently extensively incorporated into industrial and domain applications. For example, NLP is used for speeding up processes, e.g., automation classification of types of customer feedback or filtering out spam emails, information extraction, e.g., named entity recognition to extract symptoms, diagnoses, and treatments from medical records, or auto-completing input forms with language models. Despite the broad integration, domain-specific NLP applications may require practicing more user-driven methodologies to address user needs with these applications. Often, the data-driven approach falls short in exploring the needs of the domain users (Yang, 2018). On the one hand, domain users are often integrated into development at the late test phase to evaluate the usability of ML/NLP applications (Carney, 2019). Unlike user-driven software development, the development of NLP applications depends mainly on data availability or experimenting with machine learning (ML)/NLP trends and thus is a major driver of application development. On the other hand, the user-driven development of a domain-specific ML/NLP application in medicine showed that close collaboration with the domain users in the earlier stages increases the effectiveness of the final product (Yang, 2017). Therefore, integrating user experience (UX) and human-computer interaction (HCI) research into ML/NLP research addresses users' needs, fuses their expertise, and increases intuitiveness, transparency, simplicity, and trust for the system users (Boukhelifa et al, 2018; Paleyes et al, 2022).
Generative Dense Retrieval: Memory Can Be a Burden
Yuan, Peiwen, Wang, Xinglin, Feng, Shaoxiong, Pan, Boyuan, Li, Yiwei, Wang, Heda, Miao, Xupeng, Li, Kan
Generative Retrieval (GR), autoregressively decoding relevant document identifiers given a query, has been shown to perform well under the setting of small-scale corpora. By memorizing the document corpus with model parameters, GR implicitly achieves deep interaction between query and document. However, such a memorizing mechanism faces three drawbacks: (1) Poor memory accuracy for fine-grained features of documents; (2) Memory confusion gets worse as the corpus size increases; (3) Huge memory update costs for new documents. To alleviate these problems, we propose the Generative Dense Retrieval (GDR) paradigm. Specifically, GDR first uses the limited memory volume to achieve inter-cluster matching from query to relevant document clusters. Memorizing-free matching mechanism from Dense Retrieval (DR) is then introduced to conduct fine-grained intra-cluster matching from clusters to relevant documents. The coarse-to-fine process maximizes the advantages of GR's deep interaction and DR's scalability. Besides, we design a cluster identifier constructing strategy to facilitate corpus memory and a cluster-adaptive negative sampling strategy to enhance the intra-cluster mapping ability. Empirical results show that GDR obtains an average of 3.0 R@100 improvement on NQ dataset under multiple settings and has better scalability.
Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native
Lu, Yao, Bian, Song, Chen, Lequn, He, Yongjun, Hui, Yulong, Lentz, Matthew, Li, Beibin, Liu, Fei, Li, Jialin, Liu, Qi, Liu, Rui, Liu, Xiaoxuan, Ma, Lin, Rong, Kexin, Wang, Jianguo, Wu, Yingjun, Wu, Yongji, Zhang, Huanchen, Zhang, Minjia, Zhang, Qizhen, Zhou, Tianyi, Zhuo, Danyang
In this paper, we investigate the intersection of large generative AI models and cloud-native computing architectures. Recent large models such as ChatGPT, while revolutionary in their capabilities, face challenges like escalating costs and demand for high-end GPUs. Drawing analogies between large-model-as-a-service (LMaaS) and cloud database-as-a-service (DBaaS), we describe an AI-native computing paradigm that harnesses the power of both cloud-native technologies (e.g., multi-tenancy and serverless computing) and advanced machine learning runtime (e.g., batched LoRA inference). These joint efforts aim to optimize costs-of-goods-sold (COGS) and improve resource accessibility. The journey of merging these two domains is just at the beginning and we hope to stimulate future research and development in this area.
BERTologyNavigator: Advanced Question Answering with BERT-based Semantics
Rajpal, Shreya, Usbeck, Ricardo
The development and integration of knowledge graphs and language models has significance in artificial intelligence and natural language processing. In this study, we introduce the BERTologyNavigator -- a two-phased system that combines relation extraction techniques and BERT embeddings to navigate the relationships within the DBLP Knowledge Graph (KG). Our approach focuses on extracting one-hop relations and labelled candidate pairs in the first phases. This is followed by employing BERT's CLS embeddings and additional heuristics for relation selection in the second phase. Our system reaches an F1 score of 0.2175 on the DBLP QuAD Final test dataset for Scholarly QALD and 0.98 F1 score on the subset of the DBLP QuAD test dataset during the QA phase.
QAnswer: Towards Question Answering Search over Websites
Guo, Kunpeng, Defretiere, Clement, Diefenbach, Dennis, Gravier, Christophe, Gourru, Antoine
Question Answering (QA) is increasingly used by search engines to provide results to their end-users, yet very few websites currently use QA technologies for their search functionality. To illustrate the potential of QA technologies for the website search practitioner, we demonstrate web searches that combine QA over knowledge graphs and QA over free text -- each being usually tackled separately. We also discuss the different benefits and drawbacks of both approaches for web site searches. We use the case studies made of websites hosted by the Wikimedia Foundation (namely Wikipedia and Wikidata). Differently from a search engine (e.g. Google, Bing, etc), the data are indexed integrally, i.e. we do not index only a subset, and they are indexed exclusively, i.e. we index only data available on the corresponding website.
Spatial Entity Resolution between Restaurant Locations and Transportation Destinations in Southeast Asia
Solving this problem can improve precision by removing duplicates, and can enrich detail by (for example) merging a phone Location matters in many businesses and services today, number from one record with the hours of operation particularly for transportation and delivery, scenarios from another, once these records are known to refer in which it is important to find the correct pickup to the same thing. This problem is referred to as entity and drop-off locations very quickly. User experience resolution (see (Talburt, 2011)), and it occurs with can be negatively affected if the location information various datasets, including those representing people, is inaccurate or insufficient. Inaccuracies products, works of literature, etc. can originate from imprecise GPS data, manual error happening in the process of data entry, or the lack of For Grab, one entity resolution problem that arises effective data quality control. Insufficiencies can also for spatial data is the alignment of transportation destinations take many forms, including lack of coverage, and lack and restaurants. Currently Grab maintains of detail -- for example, we may know the latitude two tables separately for transportation and food delivery, and longitude of a restaurant location in a mall, but because each use case requires some specific this might not include information about where passengers features, i.e., food delivery needs information about should be dropped off, or where a delivery the estimated delivery time, cuisine types, and opening courier should park to collect food for delivery. Or hours which are absent in the POI table. However, the location of a business may be known, but not its it is highly likely that some entities from both tables contact details or opening hours.
Wikidata as a seed for Web Extraction
Guo, Kunpeng, Diefenbach, Dennis, Gourru, Antoine, Gravier, Christophe
Wikidata has grown to a knowledge graph with an impressive size. To date, it contains more than 17 billion triples collecting information about people, places, films, stars, publications, proteins, and many more. On the other side, most of the information on the Web is not published in highly structured data repositories like Wikidata, but rather as unstructured and semi-structured content, more concretely in HTML pages containing text and tables. Finding, monitoring, and organizing this data in a knowledge graph is requiring considerable work from human editors. The volume and complexity of the data make this task difficult and time-consuming. In this work, we present a framework that is able to identify and extract new facts that are published under multiple Web domains so that they can be proposed for validation by Wikidata editors. The framework is relying on question-answering technologies. We take inspiration from ideas that are used to extract facts from textual collections and adapt them to extract facts from Web pages. For achieving this, we demonstrate that language models can be adapted to extract facts not only from textual collections but also from Web pages. By exploiting the information already contained in Wikidata the proposed framework can be trained without the need for any additional learning signals and can extract new facts for a wide range of properties and domains. Following this path, Wikidata can be used as a seed to extract facts on the Web. Our experiments show that we can achieve a mean performance of 84.07 at F1-score. Moreover, our estimations show that we can potentially extract millions of facts that can be proposed for human validation. The goal is to help editors in their daily tasks and contribute to the completion of the Wikidata knowledge graph.
On Image Search in Histopathology
Tizhoosh, H. R., Pantanowitz, Liron
Pathology images of histopathology can be acquired from camera-mounted microscopes or whole slide scanners. Utilizing similarity calculations to match patients based on these images holds significant potential in research and clinical contexts. Recent advancements in search technologies allow for nuanced quantification of cellular structures across diverse tissue types, facilitating comparisons and enabling inferences about diagnosis, prognosis, and predictions for new patients when compared against a curated database of diagnosed and treated cases. In this paper, we comprehensively review the latest developments in image search technologies for histopathology, offering a concise overview tailored for computational pathology researchers seeking effective, fast and efficient image search methods in their work.
Mapping Transformer Leveraged Embeddings for Cross-Lingual Document Representation
Tashu, Tsegaye Misikir, Kontos, Eduard-Raul, Sabatelli, Matthia, Valdenegro-Toro, Matias
The rapid expansion of online information from diverse sources and the growing multilingual nature of the web underscore the escalating significance of information retrieval (IR) and recommender systems (RS). Today's web is no longer limited to a single language, but is increasingly rich in multiple languages, mirroring the multilingual capacities of its global users Steichen et al. [2014], Tashu et al. [2023]. This diversity highlights the urgent need for cross-lingual recommender systems. Traditional recommender systems often prioritize content in a single language, sidelining a wealth of multilingual documents that may hold valuable insights. This gap leads to the emergence of cross-language information access, where recommender systems suggest items in different languages based on user queries Lops et al. [2010], Narducci et al. [2016], Salamon et al. [2021]. Machine Learning and Deep Learning, which have significantly impacted language representation and processing, are pivotal to enhancing information retrieval and recommender systems, especially in the realm of document recom-The result presented in this work is based on Eduard-Raul Kontos's bachelor project while he was at the University of Groningen
A Large-Scale Analysis of Persian Tweets Regarding Covid-19 Vaccination
ShabaniMirzaei, Taha, Chamani, Houmaan, Abaskohi, Amirhossein, Zadeh, Zhivar Sourati Hassan, Bahrak, Behnam
The Covid-19 pandemic had an enormous effect on our lives, especially on people's interactions. By introducing Covid-19 vaccines, both positive and negative opinions were raised over the subject of taking vaccines or not. In this paper, using data gathered from Twitter, including tweets and user profiles, we offer a comprehensive analysis of public opinion in Iran about the Coronavirus vaccines. For this purpose, we applied a search query technique combined with a topic modeling approach to extract vaccine-related tweets. We utilized transformer-based models to classify the content of the tweets and extract themes revolving around vaccination. We also conducted an emotion analysis to evaluate the public happiness and anger around this topic. Our results demonstrate that Covid-19 vaccination has attracted considerable attention from different angles, such as governmental issues, safety or hesitancy, and side effects. Moreover, Coronavirus-relevant phenomena like public vaccination and the rate of infection deeply impacted public emotional status and users' interactions.