Goto

Collaborating Authors

 Information Retrieval


Analyzing Hong Kong's Legal Judgments from a Computational Linguistics point-of-view

arXiv.org Artificial Intelligence

Analysis and extraction of useful information from legal judgments using computational linguistics was one of the earliest problems posed in the domain of information retrieval. Presently, several commercial vendors exist who automate such tasks. However, a crucial bottleneck arises in the form of exorbitant pricing and lack of resources available in analysis of judgements mete out by Hong Kong's Legal System. This paper attempts to bridge this gap by providing several statistical, machine learning, deep learning and zero-shot learning based methods to effectively analyze legal judgments from Hong Kong's Court System. The methods proposed consists of: (1) Citation Network Graph Generation, (2) PageRank Algorithm, (3) Keyword Analysis and Summarization, (4) Sentiment Polarity, and (5) Paragrah Classification, in order to be able to extract key insights from individual as well a group of judgments together. This would make the overall analysis of judgments in Hong Kong less tedious and more automated in order to extract insights quickly using fast inferencing. We also provide an analysis of our results by benchmarking our results using Large Language Models making robust use of the HuggingFace ecosystem.


Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding

arXiv.org Artificial Intelligence

Unsupervised discovery of stories with correlated news articles in real-time helps people digest massive news streams without expensive human annotations. A common approach of the existing studies for unsupervised online story discovery is to represent news articles with symbolic- or graph-based embedding and incrementally cluster them into stories. Recent large language models are expected to improve the embedding further, but a straightforward adoption of the models by indiscriminately encoding all information in articles is ineffective to deal with text-rich and evolving news streams. In this work, we propose a novel thematic embedding with an off-the-shelf pretrained sentence encoder to dynamically represent articles and stories by considering their shared temporal themes. To realize the idea for unsupervised online story discovery, a scalable framework USTORY is introduced with two main techniques, theme- and time-aware dynamic embedding and novelty-aware adaptive clustering, fueled by lightweight story summaries. A thorough evaluation with real news data sets demonstrates that USTORY achieves higher story discovery performances than baselines while being robust and scalable to various streaming settings.


Doc2SoarGraph: Discrete Reasoning over Visually-Rich Table-Text Documents with Semantic-Oriented Hierarchical Graphs

arXiv.org Artificial Intelligence

Discrete reasoning over table-text documents (e.g., financial reports) gains increasing attention in recent two years. Existing works mostly simplify this challenge by manually selecting and transforming document pages to structured tables and paragraphs, hindering their practical application. In this work, we explore a more realistic problem setting in the form of TAT-DQA, i.e. to answer the question over a visually-rich table-text document. Specifically, we propose a novel Doc2SoarGraph framework with enhanced discrete reasoning capability by harnessing the differences and correlations among different elements (e.g., quantities, dates) of the given question and document with Semantic-oriented hierarchical Graph structures. We conduct extensive experiments on TAT-DQA dataset, and the results show that our proposed framework outperforms the best baseline model by 17.73% and 16.91% in terms of Exact Match (EM) and F1 score respectively on the test set, achieving the new state-of-the-art.


Natural language processing on customer note data

arXiv.org Artificial Intelligence

Automatic analysis of customer data for businesses is an area that is of interest to companies. Business to business data is studied rarely in academia due to the sensitive nature of such information. Applying natural language processing can speed up the analysis of prohibitively large sets of data. This paper addresses this subject and applies sentiment analysis, topic modelling and keyword extraction to a B2B data set. We show that accurate sentiment can be extracted from the notes automatically and the notes can be sorted by relevance into different topics. We see that without clear separation topics can lack relevance to a business context.


DocILE Benchmark for Document Information Localization and Extraction

arXiv.org Artificial Intelligence

This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain-and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero-and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETRbased Table Transformer; applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset, baselines and supplementary material are available at https://github.com/rossumai/docile. Keywords: Document AI Information Extraction Line Item Recognition Business Documents Intelligent Document Processing


Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables

arXiv.org Artificial Intelligence

In this paper, we propose Multi-Modal Databases (MMDBs), which is a new class of database systems that can seamlessly query text and tables using SQL. To enable seamless querying of textual data using SQL in an MMDB, we propose to extend relational databases with so-called multi-modal operators (MMOps) which are based on the advances of recent large language models such as GPT-3. The main idea of MMOps is that they allow text collections to be treated as tables without the need to manually transform the data. As we show in our evaluation, our MMDB prototype can not only outperform state-of-the-art approaches such as text-to-table in terms of accuracy and performance but it also requires significantly less training data to fine-tune the model for an unseen text collection.


Fluent answers from AI search engines are more likely to be wrong

New Scientist

If you think search engines powered by artificial intelligence, such as Microsoft's Bing Chat, are providing you with useful-sounding answers, it is more likely that they are wrong, researchers have found. "In these current systems, accuracy is inversely correlated with perceived utility," says Nelson Liu at Stanford University. "The things that look better end up being worse."


Visual Diagrammatic Queries in ViziQuer: Overview and Implementation

arXiv.org Artificial Intelligence

Knowledge graphs (KG) have become an important data organization paradigm. The available textual query languages for information retrieval from KGs, as SPARQL for RDF-structured data, do not provide means for involving non-technical experts in the data access process. Visual query formalisms, alongside form-based and natural language-based ones, offer means for easing user involvement in the data querying process. ViziQuer is a visual query notation and tool offering visual diagrammatic means for describing rich data queries, involving optional and negation constructs, as well as aggregation and subqueries. In this paper we review the visual ViziQuer notation from the end-user point of view and describe the conceptual and technical solutions (including abstract syntax model, followed by a generation model for textual queries) that allow mapping of the visual diagrammatic query notation into the textual SPARQL language, thus enabling the execution of rich visual queries over the actual knowledge graphs. The described solutions demonstrate the viability of the model-based approach in translating complex visual notation into a complex textual one; they serve as semantics by implementation description of the ViziQuer language and provide building blocks for further services in the ViziQuer tool context.


BiTimeBERT: Extending Pre-Trained Language Representations with Bi-Temporal Information

arXiv.org Artificial Intelligence

Time is an important aspect of documents and is used in a range of Temporal signals constitute significant features in various types NLP and IR tasks. In this work, we investigate methods for incorporating of text documents such as news articles or biographies. They can temporal information during pre-training to further improve be leveraged to understand chronology, causalities, developments, the performance on time-related tasks. Compared with common and ramifications of events, being helpful in a range of different pre-trained language models like BERT which utilize synchronic NLP tasks. Utilizing temporal signals in information retrieval has received document collections (e.g., BookCorpus and Wikipedia) as the training considerable attention recently, too. For example, researchers corpora, we use long-span temporal news article collection for have addressed time-sensitive queries in search leading to the formation building word representations. We introduce BiTimeBERT, a novel of a subset of Information Retrieval called Temporal Information language representation model trained on a temporal collection Retrieval [8, 26] in which both query and document of news articles via two new pre-training tasks, which harnesses temporal aspects are of key concern. Event detection and ordering two distinct temporal signals to construct time-aware language [14, 47], timeline summarization [2, 10, 36, 46, 50], event occurrence representations. The experimental results show that BiTimeBERT time prediction [54], temporal clustering [9], question answering consistently outperforms BERT and other existing pre-trained models [39, 52] and semantic change detection [41, 42] are other example with substantial gains on different downstream NLP tasks and tasks where utilizing temporal information has proven beneficial.


Multivariate Representation Learning for Information Retrieval

arXiv.org Artificial Intelligence

Dense retrieval models use bi-encoder network architectures for learning query and document representations. These representations are often in the form of a vector representation and their similarities are often computed using the dot product function. In this paper, we propose a new representation learning framework for dense retrieval. Instead of learning a vector for each query and document, our framework learns a multivariate distribution and uses negative multivariate KL divergence to compute the similarity between distributions. For simplicity and efficiency reasons, we assume that the distributions are multivariate normals and then train large language models to produce mean and variance vectors for these distributions. We provide a theoretical foundation for the proposed framework and show that it can be seamlessly integrated into the existing approximate nearest neighbor algorithms to perform retrieval efficiently. We conduct an extensive suite of experiments on a wide range of datasets, and demonstrate significant improvements compared to competitive dense retrieval models.