AITopics

2503.08131

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
North America > United States > New York > New York County > New York City (0.04)
Europe > Monaco (0.04)
Asia > Russia > Siberian Federal District > Novosibirsk Oblast > Novosibirsk (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.52)

arXiv.org Artificial IntelligenceMar-10-2025

Advancing Vietnamese Information Retrieval with Learning Objective and Benchmark

Nguyen, Phu-Vinh, Tran, Minh-Nam, Nguyen, Long, Dinh, Dien

With the rapid development of natural language processing, many language models have been invented for multiple tasks. One important task is information retrieval (IR), which requires models to retrieve relevant documents. Despite its importance in many real-life applications, especially in retrieval augmented generation (RAG) systems, this task lacks Vietnamese benchmarks. This situation causes difficulty in assessing and comparing many existing Vietnamese embedding language models on the task and slows down the advancement of Vietnamese natural language processing (NLP) research. In this work, we aim to provide the Vietnamese research community with a new benchmark for information retrieval, which mainly focuses on retrieval and reranking tasks. Furthermore, we also present a new objective function based on the InfoNCE loss function, which is used to train our Vietnamese embedding model. Our function aims to be better than the origin in information retrieval tasks. Finally, we analyze the effect of temperature, a hyper-parameter in both objective functions, on the performance of text embedding models.

benchmark, language model, loss function, (15 more...)

2503.0747

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
(8 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

arXiv.org Artificial IntelligenceMar-10-2025

Mixture of Structural-and-Textual Retrieval over Text-rich Graph Knowledge Bases

Lei, Yongjia, Han, Haoyu, Rossi, Ryan A., Dernoncourt, Franck, Lipka, Nedim, Halappanavar, Mahantesh M, Tang, Jiliang, Wang, Yu

Text-rich Graph Knowledge Bases (TG-KBs) have become increasingly crucial for answering queries by providing textual and structural knowledge. However, current retrieval methods often retrieve these two types of knowledge in isolation without considering their mutual reinforcement and some hybrid methods even bypass structural retrieval entirely after neighboring aggregation. To fill in this gap, we propose a Mixture of Structural-and-Textual Retrieval (MoR) to retrieve these two types of knowledge via a Planning-Reasoning-Organizing framework. In the Planning stage, MoR generates textual planning graphs delineating the logic for answering queries. Following planning graphs, in the Reasoning stage, MoR interweaves structural traversal and textual matching to obtain candidates from TG-KBs. In the Organizing stage, MoR further reranks fetched candidates based on their structural trajectory. Extensive experiments demonstrate the superiority of MoR in harmonizing structural and textual retrieval with insights, including uneven retrieving performance across different query logics and the benefits of integrating structural trajectories for candidate reranking. Our code is available at https://github.com/Yoega/MoR.

graph, knowledge, retrieval, (15 more...)

2502.20317

Country:

Europe > Switzerland > Vaud > Lausanne (0.04)
North America > United States > Oregon (0.04)
North America > United States > Michigan (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.85)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.68)
(2 more...)

Kuffo, Leonardo, Krippner, Elena, Boncz, Peter

PDX: A Data Layout for Vector Similarity Search

We propose Partition Dimensions Across (PDX), a data layout for vectors (e.g., embeddings) that, similar to PAX [6], stores multiple vectors in one block, using a vertical layout for the dimensions (Figure 1). PDX accelerates exact and approximate similarity search thanks to its dimension-by-dimension search strategy that operates on multiple-vectors-at-a-time in tight loops. It beats SIMD-optimized distance kernels on standard horizontal vector storage (avg 40% faster), only relying on scalar code that gets auto-vectorized. We combined the PDX layout with recent dimension-pruning algorithms ADSampling [19] and BSA [52] that accelerate approximate vector search. We found that these algorithms on the horizontal vector layout can lose to SIMD-optimized linear scans, even if they are SIMD-optimized. However, when used on PDX, their benefit is restored to 2-7x. We find that search on PDX is especially fast if a limited number of dimensions has to be scanned fully, which is what the dimension-pruning approaches do. We finally introduce PDX-BOND, an even more flexible dimension-pruning strategy, with good performance on exact search and reasonable performance on approximate search. Unlike previous pruning algorithms, it can work on vector data "as-is" without preprocessing; making it attractive for vector databases with frequent updates.

dimension, information retrieval, machine learning, (19 more...)

2503.04422

Country:

Europe > Germany > Berlin (0.05)
Europe > Netherlands > North Holland > Amsterdam (0.04)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Kardan, Mehmet, Piryani, Bhawna, Jatowt, Adam

Evaluating Answer Reranking Strategies in Time-sensitive Question Answering

Despite advancements in state-of-the-art models and information retrieval techniques, current systems still struggle to handle temporal information and to correctly answer detailed questions about past events. In this paper, we investigate the impact of temporal characteristics of answers in Question Answering (QA) by exploring several simple answer selection techniques. Our findings emphasize the role of temporal features in selecting the most relevant answers from diachronic document collections and highlight differences between explicit and implicit temporal questions.

dataset, information, publication date, (14 more...)

2503.04972

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.15)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.88)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.63)

IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval

Song, Tingyu, Gan, Guo, Shang, Mingsheng, Zhao, Yilun

We introduce IFIR, the first comprehensive benchmark designed to evaluate instruction-following information retrieval (IR) in expert domains. IFIR includes 2,426 high-quality examples and covers eight subsets across four specialized domains: finance, law, healthcare, and science literature. Each subset addresses one or more domain-specific retrieval tasks, replicating real-world scenarios where customized instructions are critical. IFIR enables a detailed analysis of instruction-following retrieval capabilities by incorporating instructions at different levels of complexity. We also propose a novel LLM-based evaluation method to provide a more precise and reliable assessment of model performance in following instructions. Through extensive experiments on 15 frontier retrieval models, including those based on LLMs, our results reveal that current models face significant challenges in effectively following complex, domain-specific instructions. We further provide in-depth analyses to highlight these limitations, offering valuable insights to guide future advancements in retriever development.

instruction, retrieval, retriever, (14 more...)

2503.04644

Country:

Asia > India (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Government (1.00)
Education (0.93)
Law > Litigation (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

RCRank: Multimodal Ranking of Root Causes of Slow Queries in Cloud Database Systems

Ouyang, Biao, Zhang, Yingying, Cheng, Hanyin, Shu, Yang, Guo, Chenjuan, Yang, Bin, Wen, Qingsong, Fan, Lunting, Jensen, Christian S.

With the continued migration of storage to cloud database systems,the impact of slow queries in such systems on services and user experience is increasing. Root-cause diagnosis plays an indispensable role in facilitating slow-query detection and revision. This paper proposes a method capable of both identifying possible root cause types for slow queries and ranking these according to their potential for accelerating slow queries. This enables prioritizing root causes with the highest impact, in turn improving slow-query revision effectiveness. To enable more accurate and detailed diagnoses, we propose the multimodal Ranking for the Root Causes of slow queries (RCRank) framework, which formulates root cause analysis as a multimodal machine learning problem and leverages multimodal information from query statements, execution plans, execution logs, and key performance indicators. To obtain expressive embeddings from its heterogeneous multimodal input, RCRank integrates self-supervised pre-training that enhances cross-modal alignment and task relevance. Next, the framework integrates root-cause-adaptive cross Transformers that enable adaptive fusion of multimodal features with varying characteristics. Finally, the framework offers a unified model that features an impact-aware training objective for identifying and ranking root causes. We report on experiments on real and synthetic datasets, finding that RCRank is capable of consistently outperforming the state-of-the-art methods at root cause identification and ranking according to a range of metrics.

information, query, slow query, (17 more...)

2503.04252

Country:

Asia > China (0.05)
North America > United States (0.04)
Europe > Denmark > North Jutland > Aalborg (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.67)

Technology:

Information Technology > Databases (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

arXiv.org Artificial IntelligenceMar-4-2025

A Comprehensive Survey on Composed Image Retrieval

Song, Xuemeng, Lin, Haoqiang, Wen, Haokun, Hou, Bohan, Xu, Mingzhu, Nie, Liqiang

Composed Image Retrieval (CIR) is an emerging yet challenging task that allows users to search for target images using a multimodal query, comprising a reference image and a modification text specifying the user's desired changes to the reference image. Given its significant academic and practical value, CIR has become a rapidly growing area of interest in the computer vision and machine learning communities, particularly with the advances in deep learning. To the best of our knowledge, there is currently no comprehensive review of CIR to provide a timely overview of this field. Therefore, we synthesize insights from over 120 publications in top conferences and journals, including ACM TOIS, SIGIR, and CVPR In particular, we systematically categorize existing supervised CIR and zero-shot CIR models using a fine-grained taxonomy. For a comprehensive review, we also briefly discuss approaches for tasks closely related to CIR, such as attribute-based CIR and dialog-based CIR. Additionally, we summarize benchmark datasets for evaluation and analyze existing supervised and zero-shot CIR methods by comparing experimental results across multiple datasets. Furthermore, we present promising future directions in this field, offering practical insights for researchers interested in further exploration. The curated collection of related works is maintained and continuously updated in https://github.com/haokunwen/Awesome-Composed-Image-Retrieval.

modification text, proceedings, target image, (12 more...)

2502.18495

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)
(6 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Leisure & Entertainment (0.45)
Education (0.45)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(2 more...)

arXiv.org Artificial IntelligenceMar-3-2025

\textsc{Perseus}: Tracing the Masterminds Behind Cryptocurrency Pump-and-Dump Schemes

Fu, Honglin, Feng, Yebo, Wu, Cong, Xu, Jiahua

Masterminds are entities organizing, coordinating, and orchestrating cryptocurrency pump-and-dump schemes, a form of trade-based manipulation undermining market integrity and causing financial losses for unwitting investors. Previous research detects pump-and-dump activities in the market, predicts the target cryptocurrency, and examines investors and \ac{osn} entities. However, these solutions do not address the root cause of the problem. There is a critical gap in identifying and tracing the masterminds involved in these schemes. In this research, we develop a detection system \textsc{Perseus}, which collects real-time data from the \acs{osn} and cryptocurrency markets. \textsc{Perseus} then constructs temporal attributed graphs that preserve the direction of information diffusion and the structure of the community while leveraging \ac{gnn} to identify the masterminds behind pump-and-dump activities. Our design of \textsc{Perseus} leads to higher F1 scores and precision than the \ac{sota} fraud detection method, achieving fast training and inferring speeds. Deployed in the real world from February 16 to October 9 2024, \textsc{Perseus} successfully detects $438$ masterminds who are efficient in the pump-and-dump information diffusion networks. \textsc{Perseus} provides regulators with an explanation of the risks of masterminds and oversight capabilities to mitigate the pump-and-dump schemes of cryptocurrency.

crowd-pump event, mastermind, spreader, (14 more...)

2503.01686

Country:

Europe (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Santa Clara (0.04)

Genre: Research Report (0.82)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > e-Commerce > Financial Technology (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(3 more...)

Dalal, Dhairya, Gupta, Sharmi Dev, Binaei, Bentolhoda

A Semantic Search Pipeline for Causality-driven Adhoc Information Retrieval

arXiv.org Artificial IntelligenceMar-2-2025

We present a unsupervised semantic search pipeline for the Causality-driven Adhoc Information Retrieval (CAIR-2021) shared task. The CAIR shared task expands traditional information retrieval to support the retrieval of documents containing the likely causes of a query event. A successful system must be able to distinguish between topical documents and documents containing causal descriptions of events that are causally related to the query event. Our approach involves aggregating results from multiple query strategies over a semantic and lexical index. The proposed approach leads the CAIR-2021 leaderboard and outperformed both traditional IR and pure semantic embedding-based approaches.

baseline, query event, semantic search pipeline, (6 more...)

2503.01003

Country:

Asia > India (0.05)
North America > United States > New York > New York County > New York City (0.04)
Europe > Ireland (0.04)
(2 more...)

Genre: Research Report (0.41)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)