AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Analyzing the State of Computer Science Research with the DBLP Discovery Dataset

Küll, Lennart

arXiv.org Artificial IntelligenceDec-1-2022

The number of scientific publications continues to rise exponentially, especially in Computer Science (CS). However, current solutions to analyze those publications restrict access behind a paywall, offer no features for visual analysis, limit access to their data, only focus on niches or sub-fields, and/or are not flexible and modular enough to be transferred to other datasets. In this thesis, we conduct a scientometric analysis to uncover the implicit patterns hidden in CS metadata and to determine the state of CS research. Specifically, we investigate trends of the quantity, impact, and topics for authors, venues, document types (conferences vs. journals), and fields of study (compared to, e.g., medicine). To achieve this we introduce the CS-Insights system, an interactive web application to analyze CS publications with various dashboards, filters, and visualizations. The data underlying this system is the DBLP Discovery Dataset (D3), which contains metadata from 5 million CS publications. Both D3 and CS-Insights are open-access, and CS-Insights can be easily adapted to other datasets in the future. The most interesting findings of our scientometric analysis include that i) there has been a stark increase in publications, authors, and venues in the last two decades, ii) many authors only recently joined the field, iii) the most cited authors and venues focus on computer vision and pattern recognition, while the most productive prefer engineering-related topics, iv) the preference of researchers to publish in conferences over journals dwindles, v) on average, journal articles receive twice as many citations compared to conference papers, but the contrast is much smaller for the most cited conferences and journals, and vi) journals also get more citations in all other investigated fields of study, while only CS and engineering publish more in conferences than journals.

information retrieval, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2212.00629

Country:

Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
North America > Dominican Republic (0.04)
North America > Canada > Quebec > Montreal (0.04)
(25 more...)

Genre:

Research Report > New Finding (1.00)
Overview (0.92)

Industry:

Information Technology (1.00)
Education (0.67)
Health & Medicine > Therapeutic Area (0.46)
Media > Publishing (0.45)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
Information Technology > Communications > Networks (1.00)
(6 more...)

Add feedback

OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries

Jaiswal, Shikhar, Krishnaswamy, Ravishankar, Garg, Ankit, Simhadri, Harsha Vardhan, Agrawal, Sheshansh

arXiv.org Artificial IntelligenceNov-30-2022

Since solving State-of-the-art algorithms for Approximate Nearest Neighbor Search the problem exactly requires an expensive exhaustive scan of the (ANNS) such as DiskANN, FAISS-IVF, and HNSW build data dependent database - which would be impractical for real-world indices that indices that offer substantially better accuracy and search span billions of objects - practical interactive search systems use efficiency over data-agnostic indices by overfitting to the index Approximate Nearest Neighbor Search (ANNS) algorithms with data distribution. When the query data is drawn from a different highly sub-linear query complexity [10, 18, 24, 30] to answer such distribution - e.g., when index represents image embeddings and queries. The quality of such ANN indices is often measured by query represents textual embeddings - such algorithms lose much k-recall@k which is the overlap between the top-results of the of this performance advantage. On a variety of datasets, for a fixed index search with the ground truth -nearest neighbors (-NNs) in recall target, latency is worse by an order of magnitude or more for the corpus for the query, averaged over a representative query set. Out-Of-Distribution (OOD) queries as compared to In-Distribution State-of-the-art algorithms for ANNS, such as graph-based indices (ID) queries. The question we address in this work is whether ANNS [16, 24, 30] which use data-dependent index construction, algorithms can be made efficient for OOD queries if the index construction achieve better query efficiency over prior data-agnostic methods is given access to a small sample set of these queries. We like LSH [6, 18] (see Section A.1 for more details). Such efficiency answer positively by presenting OOD-DiskANN, which uses a sparing enables these indices to serve queries with > 90% recall with a sample (1% of index set size) of OOD queries, and provides up to latency of a few milliseconds, required in interactive web scenarios.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2211.1285

Country:

North America > United States > California > San Francisco County > San Francisco (0.28)
North America > United States > Texas > Travis County > Austin (0.05)
Asia > India (0.05)
(3 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.34)

Add feedback

Fair Ranking with Noisy Protected Attributes

Mehrotra, Anay, Vishnoi, Nisheeth K.

arXiv.org Artificial IntelligenceNov-30-2022

The fair-ranking problem, which asks to rank a given set of items to maximize utility subject to group fairness constraints, has received attention in the fairness, information retrieval, and machine learning literature. Recent works, however, observe that errors in socially-salient (including protected) attributes of items can significantly undermine fairness guarantees of existing fair-ranking algorithms and raise the problem of mitigating the effect of such errors. We study the fair-ranking problem under a model where socially-salient attributes of items are randomly and independently perturbed. We present a fair-ranking framework that incorporates group fairness requirements along with probabilistic information about perturbations in socially-salient attributes. We provide provable guarantees on the fairness and utility attainable by our framework and show that it is information-theoretically impossible to significantly beat these guarantees. Our framework works for multiple non-disjoint attributes and a general class of fairness constraints that includes proportional and equal representation. Empirically, we observe that, compared to baselines, our algorithm outputs rankings with higher fairness, and has a similar or better fairness-utility trade-off compared to baselines.

constraint, information retrieval, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2211.17067

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

A Survey on Conversational Search and Applications in Biomedicine

Adatrao, Naga Sai Krishna, Gadireddy, Gowtham Reddy, Noh, Jiho

arXiv.org Artificial IntelligenceNov-28-2022

This paper aims to provide a radical rundown on Conversation Search (ConvSearch), an approach to enhance the information retrieval method where users engage in a dialogue for the information-seeking tasks. In this survey, we predominantly focused on the human interactive characteristics of the ConvSearch systems, highlighting the operations of the action modules, likely the Retrieval system, Question-Answering, and Recommender system. We labeled various ConvSearch research problems in knowledge bases, natural language processing, and dialogue management systems along with the action modules. We further categorized the framework to ConvSearch and the application is directed toward biomedical and healthcare fields for the utilization of clinical social technology. Finally, we conclude by talking through the challenges and issues of ConvSearch, particularly in Bio-Medicine. Our main aim is to provide an integrated and unified vision of the ConvSearch components from different fields, which benefit the information-seeking process in healthcare systems.

information, information retrieval, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2211.15328

Country:

North America > United States > Georgia > Cobb County > Kennesaw (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
North America > United States > Connecticut > Fairfield County > Westport (0.04)
Africa > Eswatini > Manzini > Manzini (0.04)

Genre: Overview (1.00)

Industry: Health & Medicine > Therapeutic Area (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Two Is Better Than One: Dual Embeddings for Complementary Product Recommendations

Kvernadze, Giorgi, Sudyanti, Putu Ayu G., Subedi, Nishan, Hajiaghayi, Mohammad

arXiv.org Artificial IntelligenceNov-28-2022

Embedding based product recommendations have gained popularity in recent years due to its ability to easily integrate to large-scale systems and allowing nearest neighbor searches in real-time. The bulk of studies in this area has predominantly been focused on similar item recommendations. Research on complementary item recommendations, on the other hand, still remains considerably under-explored. We define similar items as items that are interchangeable in terms of their utility and complementary items as items that serve different purposes, yet are compatible when used with one another. In this paper, we apply a novel approach to finding complementary items by leveraging dual embedding representations for products. We demonstrate that the notion of relatedness discovered in NLP for skip-gram negative sampling (SGNS) models translates effectively to the concept of complementarity when training item representations using co-purchase data. Since sparsity of purchase data is a major challenge in real-world scenarios, we further augment the model using synthetic samples to extend coverage. This allows the model to provide complementary recommendations for items that do not share co-purchase data by leveraging other abundantly available data modalities such as images, text, clicks etc. We establish the effectiveness of our approach in improving both coverage and quality of recommendations on real world data for a major online retail company. We further show the importance of task specific hyperparameter tuning in training SGNS. Our model is effective yet simple to implement, making it a great candidate for generating complementary item recommendations at any e-commerce website.

information retrieval, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2211.14982

Country:

Oceania > Australia (0.04)
North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre:

Research Report > New Finding (0.67)
Overview > Innovation (0.48)

Industry:

Information Technology > Services > e-Commerce Services (0.54)
Retail (0.54)
Health & Medicine (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.34)

Add feedback

MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding

Jia, Meihuizi, Shen, Lei, Shen, Xin, Liao, Lejian, Chen, Meng, He, Xiaodong, Chen, Zhendong, Li, Jiaqi

arXiv.org Artificial IntelligenceNov-27-2022

Multimodal named entity recognition (MNER) is a critical step in information extraction, which aims to detect entity spans and classify them to corresponding entity types given a sentence-image pair. Existing methods either (1) obtain named entities with coarse-grained visual clues from attention mechanisms, or (2) first detect fine-grained visual regions with toolkits and then recognize named entities. However, they suffer from improper alignment between entity types and visual regions or error propagation in the two-stage manner, which finally imports irrelevant visual information into texts. In this paper, we propose a novel end-to-end framework named MNER-QG that can simultaneously perform MRC-based multimodal named entity recognition and query grounding. Specifically, with the assistance of queries, MNER-QG can provide prior knowledge of entity types and visual regions, and further enhance representations of both texts and images. To conduct the query grounding task, we provide manual annotations and weak supervisions that are obtained via training a highly flexible visual grounding model with transfer learning. We conduct extensive experiments on two public MNER datasets, Twitter2015 and Twitter2017. Experimental results show that MNER-QG outperforms the current state-of-the-art models on the MNER task, and also improves the query grounding performance.

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2211.14739

Country:

Asia > China > Beijing > Beijing (0.04)
Oceania > Australia > Australian Capital Territory > Canberra (0.04)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.46)

Industry: Leisure & Entertainment (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Transformer-based Model for Word Level Language Identification in Code-mixed Kannada-English Texts

Tonja, Atnafu Lambebo, Yigezu, Mesay Gemeda, Kolesnikova, Olga, Tash, Moein Shahiki, Sidorov, Grigori, Gelbuk, Alexander

arXiv.org Artificial IntelligenceNov-25-2022

Using code-mixed data in natural language processing (NLP) research currently gets a lot of attention. Language identification of social media code-mixed text has been an interesting problem of study in recent years due to the advancement and influences of social media in communication. This paper presents the Instituto Polit\'ecnico Nacional, Centro de Investigaci\'on en Computaci\'on (CIC) team's system description paper for the CoLI-Kanglish shared task at ICON2022. In this paper, we propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts. The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.

information retrieval, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2211.14459

Country:

Africa (0.05)
South America (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.79)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.56)

Add feedback

Detecting Entities in the Astrophysics Literature: A Comparison of Word-based and Span-based Entity Recognition Methods

Dai, Xiang, Karimi, Sarvnaz

arXiv.org Artificial IntelligenceNov-24-2022

NER refers to the task of identifying A large body of scientific literature is published mentions of different types of entities in in different domains, making it difficult for researchers free-text. Types of entities of interest depend on in their respective fields to find information the domain of the text; for example disease names or keep up-to-date. Automatic information in biomedical text (Islamaj Doğan et al., 2014; extraction, in particular Named Entity Recognition Dai, 2021) or numbers in finance (Loukas et al., (NER), is one of the core methods from the 2022). Methods to recognise such entities should NLP community to assist researchers. It finds also handle different types of the text, including mentions of entities of interest in a given text, both formal and informal text, such as social media such as in medicine (Rybinski et al., 2021), astronomy posts (Karimi et al., 2015; Basaldella et al., 2020).

astrophysic literature, information retrieval, natural language, (19 more...)

arXiv.org Artificial Intelligence

2211.13819

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
Oceania > Australia > New South Wales > Sydney (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.72)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.63)

Add feedback

Probabilistic Rank and Reward: A Scalable Model for Slate Recommendation

Aouali, Imad, Hammou, Achraf Ait Sidi, Ivanov, Sergey, Sakhi, Otmane, Rohde, David, Vasile, Flavian

arXiv.org Artificial IntelligenceNov-24-2022

We introduce Probabilistic Rank and Reward (PRR), a scalable probabilistic model for personalized slate recommendation. Our approach allows state-of-the-art estimation of the user interests in the ubiquitous scenario where the user interacts with at most one item from a slate of K items. We show that the probability of a slate being successful can be learned efficiently by combining the reward, whether the user successfully interacted with the slate, and the rank, the item that was selected within the slate. PRR outperforms competing approaches that use one signal or the other and is far more scalable to large action spaces. Moreover, PRR allows fast delivery of recommendations powered by maximum inner product search (MIPS), making it suitable in low latency domains such as computational advertising.

information retrieval, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2208.06263

Country:

North America > United States > Virginia > Arlington County > Arlington (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Information Management > Search (0.89)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.66)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

SAH: Shifting-aware Asymmetric Hashing for Reverse $k$-Maximum Inner Product Search

Huang, Qiang, Wang, Yanhao, Tung, Anthony K. H.

arXiv.org Artificial IntelligenceNov-23-2022

This paper investigates a new yet challenging problem called Reverse $k$-Maximum Inner Product Search (R$k$MIPS). Given a query (item) vector, a set of item vectors, and a set of user vectors, the problem of R$k$MIPS aims to find a set of user vectors whose inner products with the query vector are one of the $k$ largest among the query and item vectors. We propose the first subquadratic-time algorithm, i.e., Shifting-aware Asymmetric Hashing (SAH), to tackle the R$k$MIPS problem. To speed up the Maximum Inner Product Search (MIPS) on item vectors, we design a shifting-invariant asymmetric transformation and develop a novel sublinear-time Shifting-Aware Asymmetric Locality Sensitive Hashing (SA-ALSH) scheme. Furthermore, we devise a new blocking strategy based on the Cone-Tree to effectively prune user vectors (in a batch). We prove that SAH achieves a theoretical guarantee for solving the RMIPS problem. Experimental results on five real-world datasets show that SAH runs 4$\sim$8$\times$ faster than the state-of-the-art methods for R$k$MIPS while achieving F1-scores of over 90\%. The code is available at \url{https://github.com/HuangQiang/SAH}.

data mining, information retrieval, machine learning, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1609/aaai.v37i4.25550

2211.12751

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Belgium > Flanders > East Flanders > Ghent (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining (0.93)
(2 more...)

Add feedback