Information Retrieval
NeuroQL: A Neuro-Symbolic Language and Dataset for Inter-Subjective Reasoning
We present a new AI task and baseline solution for Inter-Subjective Reasoning. We define inter-subjective information, to be a mixture of objective and subjective information possibly shared by different parties. Examples may include commodities and their objective properties as reported by IR (Information Retrieval) systems, that need to be cross-referenced with subjective user reviews from an online forum. For an AI system to successfully reason about both, it needs to be able to combine symbolic reasoning of objective facts with the shared consensus found on subjective user reviews. To this end we introduce the NeuroQL dataset and DSL (Domain-specific Language) as a baseline solution for this problem. NeuroQL is a neuro-symbolic language that extends logical unification with neural primitives for extraction and retrieval. It can function as a target for automatic translation of inter-subjective questions (posed in natural language) into the neuro-symbolic code that can answer them.
Unifying Vision, Text, and Layout for Universal Document Processing
Tang, Zineng, Yang, Ziyi, Wang, Guoxin, Fang, Yuwei, Liu, Yang, Zhu, Chenguang, Zeng, Michael, Zhang, Cha, Bansal, Mohit
We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation. With a novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain downstream tasks into a prompt-based sequence generation scheme. UDOP is pretrained on both large-scale unlabeled document corpora using innovative self-supervised objectives and diverse labeled data. UDOP also learns to generate document images from text and layout modalities via masked image reconstruction. To the best of our knowledge, this is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization. Our method sets the state-of-the-art on 8 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites. UDOP ranks first on the leaderboard of the Document Understanding Benchmark.
A Framework for Combining Entity Resolution and Query Answering in Knowledge Bases
Fagin, Ronald, Kolaitis, Phokion G., Lembo, Domenico, Popa, Lucian, Scafoglieri, Federico
We propose a new framework for combining entity resolution and query answering in knowledge bases (KBs) with tuple-generating dependencies (tgds) and equality-generating dependencies (egds) as rules. We define the semantics of the KB in terms of special instances that involve equivalence classes of entities and sets of values. Intuitively, the former collect all entities denoting the same real-world object, while the latter collect all alternative values for an attribute. This approach allows us to both resolve entities and bypass possible inconsistencies in the data. We then design a chase procedure that is tailored to this new framework and has the feature that it never fails; moreover, when the chase procedure terminates, it produces a universal solution, which in turn can be used to obtain the certain answers to conjunctive queries. We finally discuss challenges arising when the chase does not terminate.
ALIST: Associative Logic for Inference, Storage and Transfer. A Lingua Franca for Inference on the Web
Recent developments in support for constructing knowledge graphs have led to a rapid rise in their creation both on the Web and within organisations. Added to existing sources of data, including relational databases, APIs, etc., there is a strong demand for techniques to query these diverse sources of knowledge. While formal query languages, such as SPARQL, exist for querying some knowledge graphs, users are required to know which knowledge graphs they need to query and the unique resource identifiers of the resources they need. Although alternative techniques in neural information retrieval embed the content of knowledge graphs in vector spaces, they fail to provide the representation and query expressivity needed (e.g. inability to handle non-trivial aggregation functions such as regression). We believe that a lingua franca, i.e. a formalism, that enables such representational flexibility will increase the ability of intelligent automated agents to combine diverse data sources by inference. Our work proposes a flexible representation (alists) to support intelligent federated querying of diverse knowledge sources. Our contribution includes (1) a formalism that abstracts the representation of queries from the specific query language of a knowledge graph; (2) a representation to dynamically curate data and functions (operations) to perform non-trivial inference over diverse knowledge sources; (3) a demonstration of the expressiveness of alists to represent the diversity of representational formalisms, including SPARQL queries, and more generally first-order logic expressions.
Another Generic Setting for Entity Resolution: Basic Theory
Guo, Xiuzhan, Berrill, Arthur, Kulkarni, Ajinkya, Belezko, Kostya, Luo, Min
Benjelloun et al. \cite{BGSWW} considered the Entity Resolution (ER) problem as the generic process of matching and merging entity records judged to represent the same real world object. They treated the functions for matching and merging entity records as black-boxes and introduced four important properties that enable efficient generic ER algorithms. In this paper, we shall study the properties which match and merge functions share, model matching and merging black-boxes for ER in a partial groupoid, based on the properties that match and merge functions satisfy, and show that a partial groupoid provides another generic setting for ER. The natural partial order on a partial groupoid is defined when the partial groupoid satisfies Idempotence and Catenary associativity. Given a partial order on a partial groupoid, the least upper bound and compatibility ($LU_{pg}$ and $CP_{pg}$) properties are equivalent to Idempotence, Commutativity, Associativity, and Representativity and the partial order must be the natural one we defined when the domain of the partial operation is reflexive. The partiality of a partial groupoid can be reduced using connected components and clique covers of its domain graph, and a noncommutative partial groupoid can be mapped to a commutative one homomorphically if it has the partial idempotent semigroup like structures. In a finitely generated partial groupoid $(P,D,\circ)$ without any conditions required, the ER we concern is the full elements in $P$. If $(P,D,\circ)$ satisfies Idempotence and Catenary associativity, then the ER is the maximal elements in $P$, which are full elements and form the ER defined in \cite{BGSWW}. Furthermore, in the case, since there is a transitive binary order, we consider ER as ``sorting, selecting, and querying the elements in a finitely generated partial groupoid."
A Theoretical Analysis Of Nearest Neighbor Search On Approximate Near Neighbor Graph
Shrivastava, Anshumali, Song, Zhao, Xu, Zhaozhuo
Graph-based algorithms have demonstrated state-of-the-art performance in the nearest neighbor search (NN-Search) problem. These empirical successes urge the need for theoretical results that guarantee the search quality and efficiency of these algorithms. However, there exists a practice-to-theory gap in the graph-based NN-Search algorithms. Current theoretical literature focuses on greedy search on exact near neighbor graph while practitioners use approximate near neighbor graph (ANN-Graph) to reduce the preprocessing time. This work bridges this gap by presenting the theoretical guarantees of solving NN-Search via greedy search on ANN-Graph for low dimensional and dense vectors. To build this bridge, we leverage several novel tools from computational geometry. Our results provide quantification of the trade-offs associated with the approximation while building a near neighbor graph. We hope our results will open the door for more provable efficient graph-based NN-Search algorithms.
A Survey on Event-based News Narrative Extraction
Norambuena, Brian Keith, Mitra, Tanushree, North, Chris
Narratives are fundamental to our understanding of the world, providing us with a natural structure for knowledge representation over time. Computational narrative extraction is a subfield of artificial intelligence that makes heavy use of information retrieval and natural language processing techniques. Despite the importance of computational narrative extraction, relatively little scholarly work exists on synthesizing previous research and strategizing future research in the area. In particular, this article focuses on extracting news narratives from an event-centric perspective. Extracting narratives from news data has multiple applications in understanding the evolving information landscape. This survey presents an extensive study of research in the area of event-based news narrative extraction. In particular, we screened over 900 articles that yielded 54 relevant articles. These articles are synthesized and organized by representation model, extraction criteria, and evaluation approaches. Based on the reviewed studies, we identify recent trends, open challenges, and potential research lines.
Malvertising in Google search results delivering stealers
In recent months, we observed an increase in the number of malicious campaigns that use Google Advertising as a means of distributing and delivering malware. At least two different stealers, Rhadamanthys and RedLine, were abusing the search engine promotion plan in order to deliver malicious payloads to victims' machines. They seem to use the same technique of mimicking a website associated with well-known software like Notepad and Blender 3D. The treat actors create copies of legit software websites while employing typosquatting (exploiting incorrectly spelled popular brands and company names as URLs) or combosquatting (using popular brands and company names combined with arbitrary words as URLs) to make the sites look like the real thing to the end user--the domain names allude to the original software or vendor. The design and the content of the fake web pages look the same as those of the original ones.
GlobalNER: Incorporating Non-local Information into Named Entity Recognition
Nowadays, many Natural Language Processing (NLP) tasks see the demand for incorporating knowledge external to the local information to further improve the performance. However, there is little related work on Named Entity Recognition (NER), which is one of the foundations of NLP. Specifically, no studies were conducted on the query generation and re-ranking for retrieving the related information for the purpose of improving NER. This work demonstrates the effectiveness of a DNN-based query generation method and a mention-aware re-ranking architecture based on BERTScore particularly for NER. In the end, a state-of-the-art performance of 61.56 micro-f1 score on WNUT17 dataset is achieved.
Implementation of a noisy hyperlink removal system: A semantic and relatedness approach
Taghandiki, Kazem, Ehsan, Elnaz Rezaei
As the volume of data on the web grows, the web structure graph, which is a graph representation of the web, continues to evolve. The structure of this graph has gradually shifted from content-based to non-content-based. Furthermore, spam data, such as noisy hyperlinks, in the web structure graph adversely affect the speed and efficiency of information retrieval and link mining algorithms. Previous works in this area have focused on removing noisy hyperlinks using structural and string approaches. However, these approaches may incorrectly remove useful links or be unable to detect noisy hyperlinks in certain circumstances. In this paper, a data collection of hyperlinks is initially constructed using an interactive crawler. The semantic and relatedness structure of the hyperlinks is then studied through semantic web approaches and tools such as the DBpedia ontology. Finally, the removal process of noisy hyperlinks is carried out using a reasoner on the DBpedia ontology. Our experiments demonstrate the accuracy and ability of semantic web technologies to remove noisy hyperlinks