AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Review for NeurIPS paper: HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory

Neural Information Processing SystemsFeb-11-2025, 22:47:13 GMT

The paper makes an inaccurate claim about the presence of billion-scale ANNS solutions. The performance gain of the proposed HM-ANN algorithm seems marginal when considering its learning curve in practice. The experiments do not evaluate the performance of data fetching. So it is hard to conclude that the proposed HM-ANN achieves better utilization of HM. The paper claims that the proposed HM-ANN is the first billion-scale ANNS solution on a single machine, without using compression (see the last paragraph of Introduction).

efficient billion-point nearest neighbor search, hm-ann, optimization, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (0.40)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.40)

Add feedback

Review for NeurIPS paper: HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory

Neural Information Processing SystemsFeb-11-2025, 22:47:07 GMT

The paper attempts to scale nearest neighbor search using heterogenous memory hardware. In this regard, authors devised a practical trick on top of HNSW. It is a clean node promotion strategy along the memory hierarchy using the degree information. The method was evaluated on some common large datasets, but not necessarily difficult ones. Reviewers found the setup to leverage the memory hierarchy interesting and the benefits obtained from it appears promising.

efficient billion-point nearest neighbor search, heterogeneous memory, memory hierarchy, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (0.66)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.66)

Add feedback

Reviews: Flexible Models for Microclustering with Application to Entity Resolution

Neural Information Processing SystemsFeb-11-2025, 19:19:27 GMT

The following are the main strengths of the paper. It points out and defines an important property of cluster sizes that existing infinitely exchangeable clustering models do not satisfy. There could be many applications, including and not limited to entity resolution, that require this property to be satisfied. It proposes a framework for defining infinitely exchangeable clustering models that satisfy this micro-clustering property, and analyzes why the DP mixture model is an unsatisfactory instance of this class. It then proposes two specific and interesting instances of this class using specific distributions for the number of clusters and cluster sizes and derives reseating algorithms for these instances.

algorithm, dataset, micro-clustering property, (14 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Syria (0.05)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.64)
Information Technology > Artificial Intelligence > Machine Learning (0.38)

Add feedback

Meta-Learning MCMC Proposals

Neural Information Processing SystemsFeb-11-2025, 18:25:36 GMT

Effective implementations of sampling-based probabilistic inference often require manually constructed, model-specific proposals. Inspired by recent progresses in meta-learning for training learning agents that can generalize to unseen environments, we propose a meta-learning approach to building effective and generalizable MCMC proposals. We parametrize the proposal as a neural network to provide fast approximations to block Gibbs conditionals. The learned neural proposals generalize to occurrences of common structural motifs across different models, allowing for the construction of a library of learned inference primitives that can accelerate inference on unseen models with no model-specific training required. We explore several applications including open-universe Gaussian mixture models, in which our learned proposals outperform a hand-tuned sampler, and a real-world named entity recognition task, in which our sampler yields higher final F1 scores than classical single-site Gibbs sampling.

generalize, meta-learning mcmc proposal

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.66)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.66)

Add feedback

On the query complexity of sampling from non-log-concave distributions

He, Yuchen, Zhang, Chihao

arXiv.org Machine LearningFeb-11-2025

We study the problem of sampling from a $d$-dimensional distribution with density $p(x)\propto e^{-f(x)}$, which does not necessarily satisfy good isoperimetric conditions. Specifically, we show that for any $L,M$ satisfying $LM\ge d\ge 5$, $\epsilon\in \left(0,\frac{1}{32}\right)$, and any algorithm with query accesses to the value of $f(x)$ and $\nabla f(x)$, there exists an $L$-log-smooth distribution with second moment at most $M$ such that the algorithm requires $\left(\frac{LM}{d\epsilon}\right)^{\Omega(d)}$ queries to compute a sample whose distribution is within $\epsilon$ in total variation distance to the target distribution. We complement the lower bound with an algorithm requiring $\left(\frac{LM}{d\epsilon}\right)^{\mathcal O(d)}$ queries, thereby characterizing the tight (up to the constant in the exponent) query complexity for sampling from the family of non-log-concave distributions. Our results are in sharp contrast with the recent work of Huang et al. (COLT'24), where an algorithm with quasi-polynomial query complexity was proposed for sampling from a non-log-concave distribution when $M=\mathtt{poly}(d)$. Their algorithm works under the stronger condition that all distributions along the trajectory of the Ornstein-Uhlenbeck process, starting from the target distribution, are $\mathcal O(1)$-log-smooth. We investigate this condition and prove that it is strictly stronger than requiring the target distribution to be $\mathcal O(1)$-log-smooth. Additionally, we study this condition in the context of mixtures of Gaussians. Finally, we place our results within the broader theme of ``sampling versus optimization'', as studied in Ma et al. (PNAS'19). We show that for a wide range of parameters, sampling is strictly easier than optimization by a super-exponential factor in the dimension $d$.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Machine Learning

2502.062

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.81)

Add feedback

Falconn++: A Locality-sensitive Filtering Approach for Approximate Nearest Neighbor Search

Neural Information Processing SystemsFeb-10-2025, 10:23:32 GMT

Falconn can filter out potential far away points in any hash bucket before querying, which results in higher quality candidates compared to other hashing-based solutions. Theoretically, Falconn asymptotically achieves lower query time complexity than Falconn, an optimal locality-sensitive hashing scheme on angular distance. Empirically, Falconn achieves a higher recall-speed tradeoff than Falconn on many real-world data sets. Falconn is also competitive with HNSW, an efficient representative of graph-based solutions on high search recall regimes.

approximate nearest neighbor search, falconn, locality-sensitive filtering approach, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (0.49)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.49)

Add feedback

RideKE: Leveraging Low-Resource, User-Generated Twitter Content for Sentiment and Emotion Detection in Kenyan Code-Switched Dataset

Etori, Naome A., Gini, Maria L.

arXiv.org Artificial IntelligenceFeb-10-2025

Social media has become a crucial open-access platform for individuals to express opinions and share experiences. However, leveraging low-resource language data from Twitter is challenging due to scarce, poor-quality content and the major variations in language use, such as slang and code-switching. Identifying tweets in these languages can be difficult as Twitter primarily supports high-resource languages. We analyze Kenyan code-switched data and evaluate four state-of-the-art (SOTA) transformer-based pretrained models for sentiment and emotion classification, using supervised and semi-supervised methods. We detail the methodology behind data collection and annotation, and the challenges encountered during the data curation phase. Our results show that XLM-R outperforms other models; for sentiment analysis, XLM-R supervised model achieves the highest accuracy (69.2\%) and F1 score (66.1\%), XLM-R semi-supervised (67.2\% accuracy, 64.1\% F1 score). In emotion analysis, DistilBERT supervised leads in accuracy (59.8\%) and F1 score (31\%), mBERT semi-supervised (accuracy (59\% and F1 score 26.5\%). AfriBERTa models show the lowest accuracy and F1 scores. All models tend to predict neutral sentiment, with Afri-BERT showing the highest bias and unique sensitivity to empathy emotion. https://github.com/NEtori21/Ride_hailing

information retrieval, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2024.wassa-1.19

2502.0618

Country:

Africa > Kenya > Nairobi City County > Nairobi (0.07)
Africa > Kenya > Nairobi Province (0.06)
Africa > Kenya > Mombasa County > Mombasa (0.05)
(18 more...)

Genre: Research Report > New Finding (0.54)

Industry:

Transportation > Passenger (1.00)
Information Technology (1.00)
Transportation > Ground > Road (0.93)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(2 more...)

Add feedback

Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search

Spalding-Jamieson, Jack, Robson, Eliot Wong, Zheng, Da Wei

arXiv.org Machine LearningFeb-10-2025

For very large values of $k$, we consider methods for fast $k$-means clustering of massive datasets with $10^7\sim10^9$ points in high-dimensions ($d\geq100$). All current practical methods for this problem have runtimes at least $\Omega(k^2)$. We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd's local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call "Seeded Approximate Nearest-Neighbor Search", for which we propose "Seeded Search-Graph" methods as a solution.

information retrieval, machine learning, natural language, (15 more...)

arXiv.org Machine Learning

2502.06163

Country:

North America > United States > Illinois (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Europe > Czechia > Prague (0.04)
(22 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Evaluating Entity Retrieval in Electronic Health Records: a Semantic Gap Perspective

Zhao, Zhengyun, Yuan, Hongyi, Liu, Jingjing, Chen, Haichao, Ying, Huaiyuan, Zhou, Songchi, Yu, Sheng

arXiv.org Artificial IntelligenceFeb-10-2025

Entity retrieval plays a crucial role in the utilization of Electronic Health Records (EHRs) and is applied across a wide range of clinical practices. However, a comprehensive evaluation of this task is lacking due to the absence of a public benchmark. In this paper, we propose the development and release of a novel benchmark for evaluating entity retrieval in EHRs, with a particular focus on the semantic gap issue. Using discharge summaries from the MIMIC-III dataset, we incorporate ICD codes and prescription labels associated with the notes as queries, and annotate relevance judgments using GPT-4. In total, we use 1,000 patient notes, generate 1,246 queries, and provide over 77,000 relevance annotations. To offer the first assessment of the semantic gap, we introduce a novel classification system for relevance matches. Leveraging GPT-4, we categorize each relevant pair into one of five categories: string, synonym, abbreviation, hyponym, and implication. Using the proposed benchmark, we evaluate several retrieval methods, including BM25, query expansion, and state-of-the-art dense retrievers. Our findings show that BM25 provides a strong baseline but struggles with semantic matches. Query expansion significantly improves performance, though it slightly reduces string match capabilities. Dense retrievers outperform traditional methods, particularly for semantic matches, and general-domain dense retrievers often surpass those trained specifically in the biomedical domain.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.06252

Country:

Asia > China > Beijing > Beijing (0.05)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Oregon (0.04)
(5 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Health Care Technology > Medical Record (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.55)

Add feedback

AI Enhanced Ontology Driven NLP for Intelligent Cloud Resource Query Processing Using Knowledge Graphs

Sunkara, Krishna Chaitanya, Narukulla, Krishnaiah

arXiv.org Artificial IntelligenceFeb-9-2025

The conventional resource search in cloud infrastructure relies on keyword-based searches or GUIDs, which demand exact matches and significant user effort to locate resources. These conventional search approaches often fail to interpret the intent behind natural language queries, making resource discovery inefficient and inaccessible to users. Though there exists some form of NLP based search engines, they are limited and focused more on analyzing the NLP query itself and extracting identifiers to find the resources. But they fail to search resources based on their behavior or operations or their capabilities or relationships or features or business relevance or the dynamic changing state or the knowledge these resources have. The search criteria has been changing with the inundation of AI based services which involved discovering not just the requested resources and identifiers but seeking insights. The real intent of a search has never been to just to list the resources but with some actual context such as to understand causes of some behavior in the system, compliance checks, capacity estimations, network constraints, or troubleshooting or business insights. This paper proposes an advanced Natural Language Processing (NLP) enhanced by ontology-based semantics to enable intuitive, human-readable queries which allows users to actually discover the intent-of-search itself. By constructing an ontology of cloud resources, their interactions, and behaviors, the proposed framework enables dynamic intent extraction and relevance ranking using Latent Semantic Indexing (LSI) and AI models. It introduces an automated pipeline which integrates ontology extraction by AI powered data crawlers, building a semantic knowledge base for context aware resource discovery.

ai model, cloud resource, ontology, (8 more...)

arXiv.org Artificial Intelligence

2502.18484

Country:

North America > United States > North Carolina > Wake County > Raleigh (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.82)

Industry:

Information Technology > Services (1.00)
Health & Medicine > Therapeutic Area (0.95)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.92)

Add feedback