Goto

Collaborating Authors

 Information Retrieval


Analyzing the State of Computer Science Research with the DBLP Discovery Dataset

arXiv.org Artificial Intelligence

The number of scientific publications continues to rise exponentially, especially in Computer Science (CS). However, current solutions to analyze those publications restrict access behind a paywall, offer no features for visual analysis, limit access to their data, only focus on niches or sub-fields, and/or are not flexible and modular enough to be transferred to other datasets. In this thesis, we conduct a scientometric analysis to uncover the implicit patterns hidden in CS metadata and to determine the state of CS research. Specifically, we investigate trends of the quantity, impact, and topics for authors, venues, document types (conferences vs. journals), and fields of study (compared to, e.g., medicine). To achieve this we introduce the CS-Insights system, an interactive web application to analyze CS publications with various dashboards, filters, and visualizations. The data underlying this system is the DBLP Discovery Dataset (D3), which contains metadata from 5 million CS publications. Both D3 and CS-Insights are open-access, and CS-Insights can be easily adapted to other datasets in the future. The most interesting findings of our scientometric analysis include that i) there has been a stark increase in publications, authors, and venues in the last two decades, ii) many authors only recently joined the field, iii) the most cited authors and venues focus on computer vision and pattern recognition, while the most productive prefer engineering-related topics, iv) the preference of researchers to publish in conferences over journals dwindles, v) on average, journal articles receive twice as many citations compared to conference papers, but the contrast is much smaller for the most cited conferences and journals, and vi) journals also get more citations in all other investigated fields of study, while only CS and engineering publish more in conferences than journals.


OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries

arXiv.org Artificial Intelligence

Since solving State-of-the-art algorithms for Approximate Nearest Neighbor Search the problem exactly requires an expensive exhaustive scan of the (ANNS) such as DiskANN, FAISS-IVF, and HNSW build data dependent database - which would be impractical for real-world indices that indices that offer substantially better accuracy and search span billions of objects - practical interactive search systems use efficiency over data-agnostic indices by overfitting to the index Approximate Nearest Neighbor Search (ANNS) algorithms with data distribution. When the query data is drawn from a different highly sub-linear query complexity [10, 18, 24, 30] to answer such distribution - e.g., when index represents image embeddings and queries. The quality of such ANN indices is often measured by query represents textual embeddings - such algorithms lose much k-recall@k which is the overlap between the top-results of the of this performance advantage. On a variety of datasets, for a fixed index search with the ground truth -nearest neighbors (-NNs) in recall target, latency is worse by an order of magnitude or more for the corpus for the query, averaged over a representative query set. Out-Of-Distribution (OOD) queries as compared to In-Distribution State-of-the-art algorithms for ANNS, such as graph-based indices (ID) queries. The question we address in this work is whether ANNS [16, 24, 30] which use data-dependent index construction, algorithms can be made efficient for OOD queries if the index construction achieve better query efficiency over prior data-agnostic methods is given access to a small sample set of these queries. We like LSH [6, 18] (see Section A.1 for more details). Such efficiency answer positively by presenting OOD-DiskANN, which uses a sparing enables these indices to serve queries with > 90% recall with a sample (1% of index set size) of OOD queries, and provides up to latency of a few milliseconds, required in interactive web scenarios.


Fair Ranking with Noisy Protected Attributes

arXiv.org Artificial Intelligence

The fair-ranking problem, which asks to rank a given set of items to maximize utility subject to group fairness constraints, has received attention in the fairness, information retrieval, and machine learning literature. Recent works, however, observe that errors in socially-salient (including protected) attributes of items can significantly undermine fairness guarantees of existing fair-ranking algorithms and raise the problem of mitigating the effect of such errors. We study the fair-ranking problem under a model where socially-salient attributes of items are randomly and independently perturbed. We present a fair-ranking framework that incorporates group fairness requirements along with probabilistic information about perturbations in socially-salient attributes. We provide provable guarantees on the fairness and utility attainable by our framework and show that it is information-theoretically impossible to significantly beat these guarantees. Our framework works for multiple non-disjoint attributes and a general class of fairness constraints that includes proportional and equal representation. Empirically, we observe that, compared to baselines, our algorithm outputs rankings with higher fairness, and has a similar or better fairness-utility trade-off compared to baselines.


A Survey on Conversational Search and Applications in Biomedicine

arXiv.org Artificial Intelligence

This paper aims to provide a radical rundown on Conversation Search (ConvSearch), an approach to enhance the information retrieval method where users engage in a dialogue for the information-seeking tasks. In this survey, we predominantly focused on the human interactive characteristics of the ConvSearch systems, highlighting the operations of the action modules, likely the Retrieval system, Question-Answering, and Recommender system. We labeled various ConvSearch research problems in knowledge bases, natural language processing, and dialogue management systems along with the action modules. We further categorized the framework to ConvSearch and the application is directed toward biomedical and healthcare fields for the utilization of clinical social technology. Finally, we conclude by talking through the challenges and issues of ConvSearch, particularly in Bio-Medicine. Our main aim is to provide an integrated and unified vision of the ConvSearch components from different fields, which benefit the information-seeking process in healthcare systems.


Two Is Better Than One: Dual Embeddings for Complementary Product Recommendations

arXiv.org Artificial Intelligence

Embedding based product recommendations have gained popularity in recent years due to its ability to easily integrate to large-scale systems and allowing nearest neighbor searches in real-time. The bulk of studies in this area has predominantly been focused on similar item recommendations. Research on complementary item recommendations, on the other hand, still remains considerably under-explored. We define similar items as items that are interchangeable in terms of their utility and complementary items as items that serve different purposes, yet are compatible when used with one another. In this paper, we apply a novel approach to finding complementary items by leveraging dual embedding representations for products. We demonstrate that the notion of relatedness discovered in NLP for skip-gram negative sampling (SGNS) models translates effectively to the concept of complementarity when training item representations using co-purchase data. Since sparsity of purchase data is a major challenge in real-world scenarios, we further augment the model using synthetic samples to extend coverage. This allows the model to provide complementary recommendations for items that do not share co-purchase data by leveraging other abundantly available data modalities such as images, text, clicks etc. We establish the effectiveness of our approach in improving both coverage and quality of recommendations on real world data for a major online retail company. We further show the importance of task specific hyperparameter tuning in training SGNS. Our model is effective yet simple to implement, making it a great candidate for generating complementary item recommendations at any e-commerce website.


MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding

arXiv.org Artificial Intelligence

Multimodal named entity recognition (MNER) is a critical step in information extraction, which aims to detect entity spans and classify them to corresponding entity types given a sentence-image pair. Existing methods either (1) obtain named entities with coarse-grained visual clues from attention mechanisms, or (2) first detect fine-grained visual regions with toolkits and then recognize named entities. However, they suffer from improper alignment between entity types and visual regions or error propagation in the two-stage manner, which finally imports irrelevant visual information into texts. In this paper, we propose a novel end-to-end framework named MNER-QG that can simultaneously perform MRC-based multimodal named entity recognition and query grounding. Specifically, with the assistance of queries, MNER-QG can provide prior knowledge of entity types and visual regions, and further enhance representations of both texts and images. To conduct the query grounding task, we provide manual annotations and weak supervisions that are obtained via training a highly flexible visual grounding model with transfer learning. We conduct extensive experiments on two public MNER datasets, Twitter2015 and Twitter2017. Experimental results show that MNER-QG outperforms the current state-of-the-art models on the MNER task, and also improves the query grounding performance.


Transformer-based Model for Word Level Language Identification in Code-mixed Kannada-English Texts

arXiv.org Artificial Intelligence

Using code-mixed data in natural language processing (NLP) research currently gets a lot of attention. Language identification of social media code-mixed text has been an interesting problem of study in recent years due to the advancement and influences of social media in communication. This paper presents the Instituto Polit\'ecnico Nacional, Centro de Investigaci\'on en Computaci\'on (CIC) team's system description paper for the CoLI-Kanglish shared task at ICON2022. In this paper, we propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts. The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.


Detecting Entities in the Astrophysics Literature: A Comparison of Word-based and Span-based Entity Recognition Methods

arXiv.org Artificial Intelligence

NER refers to the task of identifying A large body of scientific literature is published mentions of different types of entities in in different domains, making it difficult for researchers free-text. Types of entities of interest depend on in their respective fields to find information the domain of the text; for example disease names or keep up-to-date. Automatic information in biomedical text (Islamaj DoฤŸan et al., 2014; extraction, in particular Named Entity Recognition Dai, 2021) or numbers in finance (Loukas et al., (NER), is one of the core methods from the 2022). Methods to recognise such entities should NLP community to assist researchers. It finds also handle different types of the text, including mentions of entities of interest in a given text, both formal and informal text, such as social media such as in medicine (Rybinski et al., 2021), astronomy posts (Karimi et al., 2015; Basaldella et al., 2020).


Probabilistic Rank and Reward: A Scalable Model for Slate Recommendation

arXiv.org Artificial Intelligence

We introduce Probabilistic Rank and Reward (PRR), a scalable probabilistic model for personalized slate recommendation. Our approach allows state-of-the-art estimation of the user interests in the ubiquitous scenario where the user interacts with at most one item from a slate of K items. We show that the probability of a slate being successful can be learned efficiently by combining the reward, whether the user successfully interacted with the slate, and the rank, the item that was selected within the slate. PRR outperforms competing approaches that use one signal or the other and is far more scalable to large action spaces. Moreover, PRR allows fast delivery of recommendations powered by maximum inner product search (MIPS), making it suitable in low latency domains such as computational advertising.


SAH: Shifting-aware Asymmetric Hashing for Reverse $k$-Maximum Inner Product Search

arXiv.org Artificial Intelligence

This paper investigates a new yet challenging problem called Reverse $k$-Maximum Inner Product Search (R$k$MIPS). Given a query (item) vector, a set of item vectors, and a set of user vectors, the problem of R$k$MIPS aims to find a set of user vectors whose inner products with the query vector are one of the $k$ largest among the query and item vectors. We propose the first subquadratic-time algorithm, i.e., Shifting-aware Asymmetric Hashing (SAH), to tackle the R$k$MIPS problem. To speed up the Maximum Inner Product Search (MIPS) on item vectors, we design a shifting-invariant asymmetric transformation and develop a novel sublinear-time Shifting-Aware Asymmetric Locality Sensitive Hashing (SA-ALSH) scheme. Furthermore, we devise a new blocking strategy based on the Cone-Tree to effectively prune user vectors (in a batch). We prove that SAH achieves a theoretical guarantee for solving the RMIPS problem. Experimental results on five real-world datasets show that SAH runs 4$\sim$8$\times$ faster than the state-of-the-art methods for R$k$MIPS while achieving F1-scores of over 90\%. The code is available at \url{https://github.com/HuangQiang/SAH}.