AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Perplexity will put ads in its AI search engine and share revenue with publishers

EngadgetJul-30-2024, 13:00:52 GMT

When people type a question into Perplexity, the two-year-old search engine scours the internet and uses information from multiple sources, including online publishers, to synthesize an answer using AI. Soon, Perplexity will start sharing revenue with some publishers as part of an advertising platform it plans to launch around the end of September, the company announced on Tuesday. The initiative, known as the Perplexity Publishers' Program, comes less than two months after the San Francisco-based startup backed by investors like Jeff Bezos and NVIDIA, and valued at 3 billion, came under fire from Forbes, Wired, and Condé Nast for allegedly scraping content without permission and ignoring robots.txt, Perplexity's initial partners include TIME, Fortune, The Texas Tribune, Der Spiegel and Automattic, the company behind Wordpress.com. It's not clear exactly how much revenue Perplexity will share with publishers.

perplexity, publisher, search engine, (13 more...)

Engadget

Country:

North America > United States > Texas (0.25)
North America > United States > California > San Francisco County > San Francisco (0.25)

Industry:

Media > Publishing (1.00)
Information Technology (0.71)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.75)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

Event-Arguments Extraction Corpus and Modeling using BERT for Arabic

Aljabari, Alaa, Duaibes, Lina, Jarrar, Mustafa, Khalilia, Mohammed

arXiv.org Artificial IntelligenceJul-30-2024

Event-argument extraction is a challenging task, particularly in Arabic due to sparse linguistic resources. To fill this gap, we introduce the \hadath corpus ($550$k tokens) as an extension of Wojood, enriched with event-argument annotations. We used three types of event arguments: $agent$, $location$, and $date$, which we annotated as relation types. Our inter-annotator agreement evaluation resulted in $82.23\%$ $Kappa$ score and $87.2\%$ $F_1$-score. Additionally, we propose a novel method for event relation extraction using BERT, in which we treat the task as text entailment. This method achieves an $F_1$-score of $94.01\%$. To further evaluate the generalization of our proposed method, we collected and annotated another out-of-domain corpus (about $80$k tokens) called \testNLI and used it as a second test set, on which our approach achieved promising results ($83.59\%$ $F_1$-score). Last but not least, we propose an end-to-end system for event-arguments extraction. This system is implemented as part of SinaTools, and both corpora are publicly available at {\small \url{https://sina.birzeit.edu/wojood}}

argument, extraction, relation, (15 more...)

arXiv.org Artificial Intelligence

2407.21153

Country:

Africa > Middle East > Egypt (0.14)
Asia > Middle East > Palestine > Gaza Strip (0.05)
Asia > Thailand > Bangkok > Bangkok (0.04)
(6 more...)

Genre: Research Report > Promising Solution (0.34)

Industry: Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.94)
(2 more...)

Add feedback

Label-Guided Prompt for Multi-label Few-shot Aspect Category Detection

Guan, ChaoFeng, Zhu, YaoHui, Bai, Yu, Wang, LingYun

arXiv.org Artificial IntelligenceJul-30-2024

Multi-label few-shot aspect category detection aims at identifying multiple aspect categories from sentences with a limited number of training instances. The representation of sentences and categories is a key issue in this task. Most of current methods extract keywords for the sentence representations and the category representations. Sentences often contain many category-independent words, which leads to suboptimal performance of keyword-based methods. Instead of directly extracting keywords, we propose a label-guided prompt method to represent sentences and categories. To be specific, we design label-specific prompts to represent sentences by combining crucial contextual and semantic information. Further, the label is introduced into a prompt to obtain category descriptions by utilizing a large language model. This kind of category descriptions contain the characteristics of the aspect categories, guiding the construction of discriminative category prototypes. Experimental results on two public datasets show that our method outperforms current state-of-the-art methods with a 3.86% - 4.75% improvement in the Macro-F1 score.

category, representation, sentence representation, (15 more...)

arXiv.org Artificial Intelligence

2407.20673

Country:

Asia > China > Liaoning Province > Shenyang (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.89)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.73)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.54)

Add feedback

Unleash the Power of Ellipsis: Accuracy-enhanced Sparse Vector Technique with Exponential Noise

Liu, Yuhan, Wang, Sheng, Liu, Yixuan, Li, Feifei, Chen, Hong

arXiv.org Artificial IntelligenceJul-29-2024

The Sparse Vector Technique (SVT) is one of the most fundamental tools in differential privacy (DP). It works as a backbone for adaptive data analysis by answering a sequence of queries on a given dataset, and gleaning useful information in a privacy-preserving manner. Unlike the typical private query releases that directly publicize the noisy query results, SVT is less informative -- it keeps the noisy query results to itself and only reveals a binary bit for each query, indicating whether the query result surpasses a predefined threshold. To provide a rigorous DP guarantee for SVT, prior works in the literature adopt a conservative privacy analysis by assuming the direct disclosure of noisy query results as in typical private query releases. This approach, however, hinders SVT from achieving higher query accuracy due to an overestimation of the privacy risks, which further leads to an excessive noise injection using the Laplacian or Gaussian noise for perturbation. Motivated by this, we provide a new privacy analysis for SVT by considering its less informative nature. Our analysis results not only broaden the range of applicable noise types for perturbation in SVT, but also identify the exponential noise as optimal among all evaluated noises (which, however, is usually deemed non-applicable in prior works). The main challenge in applying exponential noise to SVT is mitigating the sub-optimal performance due to the bias introduced by noise distributions. To address this, we develop a utility-oriented optimal threshold correction method and an appending strategy, which enhances the performance of SVT by increasing the precision and recall, respectively. The effectiveness of our proposed methods is substantiated both theoretically and empirically, demonstrating significant improvements up to $50\%$ across evaluated metrics.

correction, mean correction, svt-exp, (15 more...)

arXiv.org Artificial Intelligence

2407.20068

Country: Asia > China (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.88)

Add feedback

Generative Retrieval with Preference Optimization for E-commerce Search

Li, Mingming, Wang, Huimu, Chen, Zuxu, Nie, Guangtao, Qiu, Yiming, Wang, Binbin, Tang, Guoyu, Liu, Lin, Zhuo, Jingwei

arXiv.org Artificial IntelligenceJul-29-2024

Generative retrieval introduces a groundbreaking paradigm to document retrieval by directly generating the identifier of a pertinent document in response to a specific query. This paradigm has demonstrated considerable benefits and potential, particularly in representation and generalization capabilities, within the context of large language models. However, it faces significant challenges in E-commerce search scenarios, including the complexity of generating detailed item titles from brief queries, the presence of noise in item titles with weak language order, issues with long-tail queries, and the interpretability of results. To address these challenges, we have developed an innovative framework for E-commerce search, called generative retrieval with preference optimization. This framework is designed to effectively learn and align an autoregressive model with target data, subsequently generating the final item through constraint-based beam search. By employing multi-span identifiers to represent raw item titles and transforming the task of generating titles from queries into the task of generating multi-span identifiers from queries, we aim to simplify the generation process. The framework further aligns with human preferences using click data and employs a constrained search method to identify key spans for retrieving the final item, thereby enhancing result interpretability. Our extensive experiments show that this framework achieves competitive performance on a real-world dataset, and online A/B tests demonstrate the superiority and effectiveness in improving conversion gains.

identifier, query, retrieval, (11 more...)

arXiv.org Artificial Intelligence

2407.19829

Country:

Asia > China > Beijing > Beijing (0.04)
Africa > Eswatini > Manzini > Manzini (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Services > e-Commerce Services (0.94)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.71)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.49)

Add feedback

Aligning Query Representation with Rewritten Query and Relevance Judgments in Conversational Search

Mo, Fengran, Qu, Chen, Mao, Kelong, Wu, Yihong, Su, Zhan, Huang, Kaiyu, Nie, Jian-Yun

arXiv.org Artificial IntelligenceJul-29-2024

Conversational search supports multi-turn user-system interactions to solve complex information needs. Different from the traditional single-turn ad-hoc search, conversational search encounters a more challenging problem of context-dependent query understanding with the lengthy and long-tail conversational history context. While conversational query rewriting methods leverage explicit rewritten queries to train a rewriting model to transform the context-dependent query into a stand-stone search query, this is usually done without considering the quality of search results. Conversational dense retrieval methods use fine-tuning to improve a pre-trained ad-hoc query encoder, but they are limited by the conversational search data available for training. In this paper, we leverage both rewritten queries and relevance judgments in the conversational search data to train a better query representation model. The key idea is to align the query representation with those of rewritten queries and relevant documents. The proposed model -- Query Representation Alignment Conversational Dense Retriever, QRACDR, is tested on eight datasets, including various settings in conversational search and ad-hoc search. The results demonstrate the strong performance of QRACDR compared with state-of-the-art methods, and confirm the effectiveness of representation alignment.

alignment, query, representation, (14 more...)

arXiv.org Artificial Intelligence

2407.20189

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
North America > United States > Idaho > Ada County > Boise (0.05)
North America > Canada > Quebec > Montreal (0.05)
(4 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)

Add feedback

Analyzing and reducing the synthetic-to-real transfer gap in Music Information Retrieval: the task of automatic drum transcription

Zehren, Mickaël, Alunno, Marco, Bientinesi, Paolo

arXiv.org Artificial IntelligenceJul-29-2024

Automatic drum transcription is a critical tool in Music Information Retrieval for extracting and analyzing the rhythm of a music track, but it is limited by the size of the datasets available for training. A popular method used to increase the amount of data is by generating them synthetically from music scores rendered with virtual instruments. This method can produce a virtually infinite quantity of tracks, but empirical evidence shows that models trained on previously created synthetic datasets do not transfer well to real tracks. In this work, besides increasing the amount of data, we identify and evaluate three more strategies that practitioners can use to improve the realism of the generated data and, thus, narrow the synthetic-to-real transfer gap. To explore their efficacy, we used them to build a new synthetic dataset and then we measured how the performance of a model scales and, specifically, at what value it will stagnate when increasing the number of training tracks for different datasets. By doing this, we were able to prove that the aforementioned strategies contribute to make our dataset the one with the most realistic data distribution and the lowest synthetic-to-real transfer gap among the synthetic datasets we evaluated. We conclude by highlighting the limits of training with infinite data in drum transcription and we show how they can be overcome.

dataset, generation procedure, instrument, (14 more...)

arXiv.org Artificial Intelligence

2407.19823

Country:

Europe > Sweden > Västerbotten County > Umeå (0.04)
Asia > China (0.04)
South America > Colombia > Antioquia Department > Medellín (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.61)

Add feedback

Evaluating LLMs for Text-to-SQL Generation With Complex SQL Workload

Ma, Limin, Pu, Ken, Zhu, Ying

arXiv.org Artificial IntelligenceJul-28-2024

This study presents a comparative analysis of the a complex SQL benchmark, TPC-DS, with two existing text-to-SQL benchmarks, BIRD and Spider. Our findings reveal that TPC-DS queries exhibit a significantly higher level of structural complexity compared to the other two benchmarks. This underscores the need for more intricate benchmarks to simulate realistic scenarios effectively. To facilitate this comparison, we devised several measures of structural complexity and applied them across all three benchmarks. The results of this study can guide future research in the development of more sophisticated text-to-SQL benchmarks. We utilized 11 distinct Language Models (LLMs) to generate SQL queries based on the query descriptions provided by the TPC-DS benchmark. The prompt engineering process incorporated both the query description as outlined in the TPC-DS specification and the database schema of TPC-DS. Our findings indicate that the current state-of-the-art generative AI models fall short in generating accurate decision-making queries. We conducted a comparison of the generated queries with the TPC-DS gold standard queries using a series of fuzzy structure matching techniques based on query features. The results demonstrated that the accuracy of the generated queries is insufficient for practical real-world application.

benchmark, query, sql query, (15 more...)

arXiv.org Artificial Intelligence

2407.19517

Country:

North America > Canada > Ontario (0.04)
Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.49)

Add feedback

The Morning After: OpenAI reveals its AI-powered search engine, SearchGPT

EngadgetJul-26-2024, 11:16:56 GMT

OpenAI announced a new AI-powered search engine prototype called SearchGPT. It's described SearchGPT as "a temporary prototype of new AI search features that give you fast and timely answers with clear and relevant sources." The company plans to test out the product with 10,000 initial users, then roll it into ChatGPT after gathering feedback. It's a spicy time to launch AI-powered search engines. Last month, Perplexity faced criticism for summarizing stories from Forbes and Wired without adequate attribution or backlinks to the publications.

ai-powered search engine, openai reveal, searchgpt, (5 more...)

Engadget

Industry: Media (0.55)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.73)

Add feedback

ChatSchema: A pipeline of extracting structured information with Large Multimodal Models based on schema

Wang, Fei, Zheng, Yuewen, Li, Qin, Wu, Jingyi, Li, Pengfei, Zhang, Luxia

arXiv.org Artificial IntelligenceJul-26-2024

Objective: This study introduces ChatSchema, an effective method for extracting and structuring information from unstructured data in medical paper reports using a combination of Large Multimodal Models (LMMs) and Optical Character Recognition (OCR) based on the schema. By integrating predefined schema, we intend to enable LMMs to directly extract and standardize information according to the schema specifications, facilitating further data entry. Method: Our approach involves a two-stage process, including classification and extraction for categorizing report scenarios and structuring information. We established and annotated a dataset to verify the effectiveness of ChatSchema, and evaluated key extraction using precision, recall, F1-score, and accuracy metrics. Based on key extraction, we further assessed value extraction. We conducted ablation studies on two LMMs to illustrate the improvement of structured information extraction with different input modals and methods. Result: We analyzed 100 medical reports from Peking University First Hospital and established a ground truth dataset with 2,945 key-value pairs. We evaluated ChatSchema using GPT-4o and Gemini 1.5 Pro and found a higher overall performance of GPT-4o. The results are as follows: For the result of key extraction, key-precision was 98.6%, key-recall was 98.5%, key-F1-score was 98.6%. For the result of value extraction based on correct key extraction, the overall accuracy was 97.2%, precision was 95.8%, recall was 95.8%, and F1-score was 95.8%. An ablation study demonstrated that ChatSchema achieved significantly higher overall accuracy and overall F1-score of key-value extraction, compared to the Baseline, with increases of 26.9% overall accuracy and 27.4% overall F1-score, respectively.

chatschema, extraction, information, (15 more...)

arXiv.org Artificial Intelligence

2407.18716

Country:

Asia > China > Zhejiang Province > Hangzhou (0.05)
North America > United States > New York > New York County > New York City (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.88)

Industry: Health & Medicine > Health Care Technology > Medical Record (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)

Add feedback