Goto

Collaborating Authors

 Nandi, Arnab


Noise-Aware Training of Layout-Aware Language Models

arXiv.org Artificial Intelligence

A visually rich document (VRD) utilizes visual features along with linguistic cues to disseminate information. Training a custom extractor that identifies named entities from a document requires a large number of instances of the target document type annotated at textual and visual modalities. This is an expensive bottleneck in enterprise scenarios, where we want to train custom extractors for thousands of different document types in a scalable way. Pre-training an extractor model on unlabeled instances of the target document type, followed by a fine-tuning step on human-labeled instances does not work in these scenarios, as it surpasses the maximum allowable training time allocated for the extractor. We address this scenario by proposing a Noise-Aware Training method or NAT in this paper. Instead of acquiring expensive human-labeled documents, NAT utilizes weakly labeled documents to train an extractor in a scalable way. To avoid degradation in the model's quality due to noisy, weakly labeled samples, NAT estimates the confidence of each training sample and incorporates it as uncertainty measure during training. We train multiple state-of-the-art extractor models using NAT. Experiments on a number of publicly available and in-house datasets show that NAT-trained models are not only robust in performance -- it outperforms a transfer-learning baseline by up to 6% in terms of macro-F1 score, but it is also more label-efficient -- it reduces the amount of human-effort required to obtain comparable performance by up to 73%.


Cross-Modal Entity Matching for Visually Rich Documents

arXiv.org Artificial Intelligence

Visually rich documents (VRD) are physical/digital documents that utilize visual cues to augment their semantics. The information contained in these documents are often incomplete. Existing works that enable automated querying on VRDs do not take this aspect into account. Consequently, they support a limited set of queries. In this paper, we describe Juno -- a multimodal framework that identifies a set of tuples from a relational database to augment an incomplete VRD with supplementary information. Our main contribution in this is an end-to-end-trainable neural network with bi-directional attention that executes this cross-modal entity matching task without any prior knowledge about the document type or the underlying database-schema. Exhaustive experiments on two heteroegeneous datasets show that Juno outperforms state-of-the-art baselines by more than 6% in F1-score, while reducing the amount of human-effort in its workflow by more than 80%. To the best of our knowledge, ours is the first work that investigates the incompleteness of VRDs and proposes a robust framework to address it in a seamless way.


Discovery of Driving Patterns by Trajectory Segmentation

arXiv.org Artificial Intelligence

Telematics data is becoming increasingly available due to the ubiquity of devices that collect data during drives, for different purposes, such as usage based insurance (UBI), fleet management, navigation of connected vehicles, etc. Consequently, a variety of data-analytic applications have become feasible that extract valuable insights from the data. In this paper, we address the especially challenging problem of discovering behavior-based driving patterns from only externally observable phenomena (e.g. vehicle's speed). We present a trajectory segmentation approach capable of discovering driving patterns as separate segments, based on the behavior of drivers. This segmentation approach includes a novel transformation of trajectories along with a dynamic programming approach for segmentation. We apply the segmentation approach on a real-word, rich dataset of personal car trajectories provided by a major insurance company based in Columbus, Ohio. Analysis and preliminary results show the applicability of approach for finding significant driving patterns.


Surpassing Humans and Computers with JELLYBEAN: Crowd-Vision-Hybrid Counting Algorithms

AAAI Conferences

Counting objects is a fundamental image processisng primitive, and has many scientific, health, surveillance, security, and military applications. Existing supervised computer vision techniques typically require large quantities of labeled training data, and even with that, fail to return accurate results in all but the most stylized settings. Using vanilla crowdsourcing, on the other hand, can lead to significant errors, especially on images with many objects. In this paper, we present our JellyBean suite of algorithms, that combines the best of crowds and computer vision to count objects in images, and uses judicious decomposition of images to greatly improve accuracy at low cost. Our algorithms have several desirable properties: (i) they are theoretically optimal or near-optimal , in that they ask as few questions as possible to humans (under certain intuitively reasonable assumptions that we justify in our paper experimentally); (ii) they operate under stand-alone or hybrid modes, in that they can either work independent of computer vision algorithms, or work in concert with them, depending on whether the computer vision techniques are available or useful for the given setting; (iii) they perform very well in practice, returning accurate counts on images that no individual worker or computer vision algorithm can count correctly, while not incurring a high cost.