Most of the existing techniques to product discovery rely on syntactic approaches, thus ignoring valuable and specific semantic information of the underlying standards during the process. The product data comes from different heterogeneous sources and formats giving rise to the problem of interoperability. Above all, due to the continuously increasing influx of data, the manual labeling is getting costlier. Integrating the descriptions of different products into a single representation requires organizing all the products across vendors in a single taxonomy. Practically relevant and quality product categorization standards are still limited in number; and that too in academic research projects where we can majorly see only prototypes as compared to industry. This work presents a cost-effective solution for e-commerce on the Data Web by employing an unsupervised approach for data classification and exploiting the knowledge graphs for matching. The proposed architecture describes available products in web ontology language OWL and stores them in a triple store. User input specifications for certain products are matched against the available product categories to generate a knowledge graph. This mullti-phased top-down approach to develop and improve existing, if any, tailored product recommendations will be able to connect users with the exact product/service of their choice.
As an example, datasets are very often accompanied by metadata tags that include some signal about the nature and content Task specific fine-tuning of a pre-trained neural language model of each document in the corpus. Corporate communications, financial using a custom softmax output layer is the de facto approach of late reports, regulatory disclosures, policy guidelines, Wikipedia when dealing with document classification problems. This technique entries, social media messages, and many other forms of textual is not adequate when labeled examples are not available at records often bear tags indicating their source, type, purpose, or an training time and when the metadata artifacts in a document must enterprise categorization standard. This metadata is often stripped be exploited. We address these challenges by generating document before models are applied to the text, in order to avoid convoluted representations that capture both text and metadata artifacts in a and bespoke architectures. In some cases, the metadata is simply task agnostic manner. Instead of traditional auto-regressive or autoencoding concatenated to the document  without specific controls on based training, our novel self-supervised approach learns how representations are generated from raw text versus metadata a soft-partition of the input space when generating text embeddings.
Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on tasks like ER require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find informative samples for the user to label, grows quadratically for instance-pair tasks making active learning hard to scale. Previous works, in this setting, rely on hand-crafted predicates, pre-trained language model embeddings, or rule learning to prune away unlikely pairs from the Cartesian product. This blocking step can miss out on important regions in the product space leading to low recall. We propose DIAL, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member learns representations based on powerful transformer models. We highlight surprising differences between the matcher and the blocker in the creation of the training data and the objective used to train their parameters. Experiments on five benchmark datasets and a multilingual record matching dataset show the effectiveness of our approach in terms of precision, recall and running time. Code is available at https://github.com/ArjitJ/DIAL
Semantic representation and inference is essential for Natural Language Processing (NLP). The state of the art for semantic representation and inference is deep learning, and particularly Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and transformer Self-Attention models. This thesis investigates the use of deep learning for novel semantic representation and inference, and makes contributions in the following three areas: creating training data, improving semantic representations and extending inference learning. In terms of creating training data, we contribute the largest publicly available dataset of real-life factual claims for the purpose of automatic claim verification (MultiFC), and we present a novel inference model composed of multi-scale CNNs with different kernel sizes that learn from external sources to infer fact checking labels. In terms of improving semantic representations, we contribute a novel model that captures non-compositional semantic indicators. By definition, the meaning of a non-compositional phrase cannot be inferred from the individual meanings of its composing words (e.g., hot dog). Motivated by this, we operationalize the compositionality of a phrase contextually by enriching the phrase representation with external word embeddings and knowledge graphs. Finally, in terms of inference learning, we propose a series of novel deep learning architectures that improve inference by using syntactic dependencies, by ensembling role guided attention heads, incorporating gating layers, and concatenating multiple heads in novel and effective ways. This thesis consists of seven publications (five published and two under review).
Entity resolution is a widely studied problem with several proposals to match records across relations. Matching textual content is a widespread task in many applications, such as question answering and search. While recent methods achieve promising results for these two tasks, there is no clear solution for the more general problem of matching textual content and structured data. We introduce a framework that supports this new task in an unsupervised setting for any pair of corpora, being relational tables or text documents. Our method builds a fine-grained graph over the content of the corpora and derives word embeddings to represent the objects to match in a low dimensional space. The learned representation enables effective and efficient matching at different granularity, from relational tuples to text sentences and paragraphs. Our flexible framework can exploit pre-trained resources, but it does not depends on their existence and achieves better quality performance in matching content when the vocabulary is domain specific. We also introduce optimizations in the graph creation process with an "expand and compress" approach that first identifies new valid relationships across elements, to improve matching, and then prunes nodes and edges, to reduce the graph size. Experiments on real use cases and public datasets show that our framework produces embeddings that outperform word embeddings and fine-tuned language models both in results' quality and in execution times.