AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Analyzing Hong Kong's Legal Judgments from a Computational Linguistics point-of-view

Sen, Sankalok

arXiv.org Artificial IntelligenceMay-4-2023

Analysis and extraction of useful information from legal judgments using computational linguistics was one of the earliest problems posed in the domain of information retrieval. Presently, several commercial vendors exist who automate such tasks. However, a crucial bottleneck arises in the form of exorbitant pricing and lack of resources available in analysis of judgements mete out by Hong Kong's Legal System. This paper attempts to bridge this gap by providing several statistical, machine learning, deep learning and zero-shot learning based methods to effectively analyze legal judgments from Hong Kong's Court System. The methods proposed consists of: (1) Citation Network Graph Generation, (2) PageRank Algorithm, (3) Keyword Analysis and Summarization, (4) Sentiment Polarity, and (5) Paragrah Classification, in order to be able to extract key insights from individual as well a group of judgments together. This would make the overall analysis of judgments in Hong Kong less tedious and more automated in order to extract insights quickly using fast inferencing. We also provide an analysis of our results by benchmarking our results using Large Language Models making robust use of the HuggingFace ecosystem.

information retrieval, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2305.02558

Country:

Asia > China > Hong Kong (1.00)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(4 more...)

Genre: Research Report > New Finding (0.55)

Industry:

Law (1.00)
Government > Regional Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding

Yoon, Susik, Lee, Dongha, Zhang, Yunyi, Han, Jiawei

arXiv.org Artificial IntelligenceMay-4-2023

Unsupervised discovery of stories with correlated news articles in real-time helps people digest massive news streams without expensive human annotations. A common approach of the existing studies for unsupervised online story discovery is to represent news articles with symbolic- or graph-based embedding and incrementally cluster them into stories. Recent large language models are expected to improve the embedding further, but a straightforward adoption of the models by indiscriminately encoding all information in articles is ineffective to deal with text-rich and evolving news streams. In this work, we propose a novel thematic embedding with an off-the-shelf pretrained sentence encoder to dynamically represent articles and stories by considering their shared temporal themes. To realize the idea for unsupervised online story discovery, a scalable framework USTORY is introduced with two main techniques, theme- and time-aware dynamic embedding and novelty-aware adaptive clustering, fueled by lightweight story summaries. A thorough evaluation with real news data sets demonstrates that USTORY achieves higher story discovery performances than baselines while being robust and scalable to various streaming settings.

data mining, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2304.04099

Country:

Asia > Russia (0.47)
Europe > Ukraine (0.15)
North America > United States > California (0.15)
(9 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Government > Military (0.94)
Media > News (0.68)
Government > Regional Government > Asia Government (0.47)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.35)

Add feedback

Doc2SoarGraph: Discrete Reasoning over Visually-Rich Table-Text Documents with Semantic-Oriented Hierarchical Graphs

Zhu, Fengbin, Wang, Chao, Feng, Fuli, Ren, Zifeng, Li, Moxin, Chua, Tat-Seng

arXiv.org Artificial IntelligenceMay-4-2023

Discrete reasoning over table-text documents (e.g., financial reports) gains increasing attention in recent two years. Existing works mostly simplify this challenge by manually selecting and transforming document pages to structured tables and paragraphs, hindering their practical application. In this work, we explore a more realistic problem setting in the form of TAT-DQA, i.e. to answer the question over a visually-rich table-text document. Specifically, we propose a novel Doc2SoarGraph framework with enhanced discrete reasoning capability by harnessing the differences and correlations among different elements (e.g., quantities, dates) of the given question and document with Semantic-oriented hierarchical Graph structures. We conduct extensive experiments on TAT-DQA dataset, and the results show that our proposed framework outperforms the best baseline model by 17.73% and 16.91% in terms of Exact Match (EM) and F1 score respectively on the test set, achieving the new state-of-the-art.

information retrieval, machine learning, node, (19 more...)

arXiv.org Artificial Intelligence

2305.01938

Country:

North America > United States > New York (0.04)
North America > Dominican Republic (0.04)
Asia > Singapore (0.04)
Asia > China (0.04)

Genre:

Overview (0.67)
Research Report > New Finding (0.34)

Industry: Banking & Finance (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.48)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Natural language processing on customer note data

Hilditch, Andrew, Webb, David, Baca, Jozef, Armitage, Tom, Shardlow, Matthew, Appleby, Peter

arXiv.org Artificial IntelligenceMay-3-2023

Automatic analysis of customer data for businesses is an area that is of interest to companies. Business to business data is studied rarely in academia due to the sensitive nature of such information. Applying natural language processing can speed up the analysis of prohibitively large sets of data. This paper addresses this subject and applies sentiment analysis, topic modelling and keyword extraction to a B2B data set. We show that accurate sentiment can be extracted from the notes automatically and the notes can be sorted by relevance into different topics. We see that without clear separation topics can lack relevance to a business context.

large language model, machine learning, sentiment, (22 more...)

arXiv.org Artificial Intelligence

2305.02029

Country:

Europe > United Kingdom (0.14)
Asia > Middle East > Jordan (0.04)
Asia > Indonesia (0.04)

Genre:

Workflow (0.93)
Overview (0.67)
Research Report (0.64)

Industry:

Health & Medicine > Therapeutic Area (0.68)
Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
(2 more...)

Add feedback

DocILE Benchmark for Document Information Localization and Extraction

Šimsa, Štěpán, Šulc, Milan, Uřičář, Michal, Patel, Yash, Hamdi, Ahmed, Kocián, Matěj, Skalický, Matyáš, Matas, Jiří, Doucet, Antoine, Coustaty, Mickaël, Karatzas, Dimosthenis

arXiv.org Artificial IntelligenceMay-3-2023

This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain-and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero-and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETRbased Table Transformer; applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset, baselines and supplementary material are available at https://github.com/rossumai/docile. Keywords: Document AI Information Extraction Line Item Recognition Business Documents Intelligent Document Processing

data mining, information retrieval, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2302.05658

Country: Europe (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.96)
Information Technology > Data Science > Data Mining > Text Mining (0.76)
(2 more...)

Add feedback

Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables

Urban, Matthias, Binnig, Carsten

arXiv.org Artificial IntelligenceApr-28-2023

In this paper, we propose Multi-Modal Databases (MMDBs), which is a new class of database systems that can seamlessly query text and tables using SQL. To enable seamless querying of textual data using SQL in an MMDB, we propose to extend relational databases with so-called multi-modal operators (MMOps) which are based on the advances of recent large language models such as GPT-3. The main idea of MMOps is that they allow text collections to be treated as tables without the need to manually transform the data. As we show in our evaluation, our MMDB prototype can not only outperform state-of-the-art approaches such as text-to-table in terms of accuracy and performance but it also requires significantly less training data to fine-tune the model for an unseen text collection.

large language model, machine learning, question answering, (18 more...)

arXiv.org Artificial Intelligence

2304.13559

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
Europe > Denmark > Capital Region > Copenhagen (0.04)
(16 more...)

Genre:

Research Report (0.70)
Overview > Innovation (0.34)

Industry:

Health & Medicine (1.00)
Leisure & Entertainment > Sports (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.68)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.68)

Add feedback

Fluent answers from AI search engines are more likely to be wrong

New ScientistApr-27-2023, 17:00:38 GMT

If you think search engines powered by artificial intelligence, such as Microsoft's Bing Chat, are providing you with useful-sounding answers, it is more likely that they are wrong, researchers have found. "In these current systems, accuracy is inversely correlated with perceived utility," says Nelson Liu at Stanford University. "The things that look better end up being worse."

ai search engine, fluent answer, search engine, (1 more...)

New Scientist

Technology:

Information Technology > Information Management > Search (0.70)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.70)

Add feedback

Visual Diagrammatic Queries in ViziQuer: Overview and Implementation

Ovčiņņikiva, Jūlija, Šostaks, Agris, Čerāns, Kārlis

arXiv.org Artificial IntelligenceApr-27-2023

Knowledge graphs (KG) have become an important data organization paradigm. The available textual query languages for information retrieval from KGs, as SPARQL for RDF-structured data, do not provide means for involving non-technical experts in the data access process. Visual query formalisms, alongside form-based and natural language-based ones, offer means for easing user involvement in the data querying process. ViziQuer is a visual query notation and tool offering visual diagrammatic means for describing rich data queries, involving optional and negation constructs, as well as aggregation and subqueries. In this paper we review the visual ViziQuer notation from the end-user point of view and describe the conceptual and technical solutions (including abstract syntax model, followed by a generation model for textual queries) that allow mapping of the visual diagrammatic query notation into the textual SPARQL language, thus enabling the execution of rich visual queries over the actual knowledge graphs. The described solutions demonstrate the viability of the model-based approach in translating complex visual notation into a complex textual one; they serve as semantics by implementation description of the ViziQuer language and provide building blocks for further services in the ViziQuer tool context.

artificial intelligence, information retrieval, natural language, (18 more...)

arXiv.org Artificial Intelligence

2304.14825

Country:

North America > Puerto Rico > Peñuelas > Peñuelas (0.04)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
North America > United States > Tennessee > Davidson County > Nashville (0.04)
(3 more...)

Genre:

Research Report (0.40)
Overview (0.34)

Technology:

Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.88)

Add feedback

BiTimeBERT: Extending Pre-Trained Language Representations with Bi-Temporal Information

Wang, Jiexin, Jatowt, Adam, Yoshikawa, Masatoshi, Cai, Yi

arXiv.org Artificial IntelligenceApr-27-2023

Time is an important aspect of documents and is used in a range of Temporal signals constitute significant features in various types NLP and IR tasks. In this work, we investigate methods for incorporating of text documents such as news articles or biographies. They can temporal information during pre-training to further improve be leveraged to understand chronology, causalities, developments, the performance on time-related tasks. Compared with common and ramifications of events, being helpful in a range of different pre-trained language models like BERT which utilize synchronic NLP tasks. Utilizing temporal signals in information retrieval has received document collections (e.g., BookCorpus and Wikipedia) as the training considerable attention recently, too. For example, researchers corpora, we use long-span temporal news article collection for have addressed time-sensitive queries in search leading to the formation building word representations. We introduce BiTimeBERT, a novel of a subset of Information Retrieval called Temporal Information language representation model trained on a temporal collection Retrieval [8, 26] in which both query and document of news articles via two new pre-training tasks, which harnesses temporal aspects are of key concern. Event detection and ordering two distinct temporal signals to construct time-aware language [14, 47], timeline summarization [2, 10, 36, 46, 50], event occurrence representations. The experimental results show that BiTimeBERT time prediction [54], temporal clustering [9], question answering consistently outperforms BERT and other existing pre-trained models [39, 52] and semantic change detection [41, 42] are other example with substantial gains on different downstream NLP tasks and tasks where utilizing temporal information has proven beneficial.

information retrieval, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2204.13032

Country:

Asia > Taiwan > Taiwan Province > Taipei (0.05)
North America > United States > New York > New York County > New York City (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
(5 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Leisure & Entertainment > Sports (0.68)
Transportation > Passenger (0.46)
Transportation > Air (0.46)
Government > Voting & Elections (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Multivariate Representation Learning for Information Retrieval

Zamani, Hamed, Bendersky, Michael

arXiv.org Artificial IntelligenceApr-27-2023

Dense retrieval models use bi-encoder network architectures for learning query and document representations. These representations are often in the form of a vector representation and their similarities are often computed using the dot product function. In this paper, we propose a new representation learning framework for dense retrieval. Instead of learning a vector for each query and document, our framework learns a multivariate distribution and uses negative multivariate KL divergence to compute the similarity between distributions. For simplicity and efficiency reasons, we assume that the distributions are multivariate normals and then train large language models to produce mean and variance vectors for these distributions. We provide a theoretical foundation for the proposed framework and show that it can be seamlessly integrated into the existing approximate nearest neighbor algorithms to perform retrieval efficiently. We conduct an extensive suite of experiments on a wide range of datasets, and demonstrate significant improvements compared to competitive dense retrieval models.

information retrieval, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2304.14522

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York > New York County > New York City (0.05)
Asia > Taiwan > Taiwan Province > Taipei (0.05)
(6 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback