Plotting

 Kumar, Arun


Unseen Object Reasoning with Shared Appearance Cues

arXiv.org Artificial Intelligence

This paper introduces an innovative approach to open world recognition (OWR), where we leverage knowledge acquired from known objects to address the recognition of previously unseen objects. The traditional method of object modeling relies on supervised learning with strict closed-set assumptions, presupposing that objects encountered during inference are already known at the training phase. However, this assumption proves inadequate for real-world scenarios due to the impracticality of accounting for the immense diversity of objects. Our hypothesis posits that object appearances can be represented as collections of "shareable" mid-level features, arranged in constellations to form object instances. By adopting this framework, we can efficiently dissect and represent both known and unknown objects in terms of their appearance cues. Our paper introduces a straightforward yet elegant method for modeling novel or unseen objects, utilizing established appearance cues and accounting for inherent uncertainties. This representation not only enables the detection of out-of-distribution objects or novel categories among unseen objects but also facilitates a deeper level of reasoning, empowering the identification of the superclass to which an unknown instance belongs. This novel approach holds promise for advancing open world recognition in diverse applications.


KIX: A Metacognitive Generalization Framework

arXiv.org Artificial Intelligence

Humans and other animals aptly exhibit general intelligence behaviors in solving a variety of tasks with flexibility and ability to adapt to novel situations by reusing and applying high level knowledge acquired over time. But artificial agents are more of a specialist, lacking such generalist behaviors. Artificial agents will require understanding and exploiting critical structured knowledge representations. We present a metacognitive generalization framework, Knowledge-Interaction-eXecution (KIX), and argue that interactions with objects leveraging type space facilitate the learning of transferable interaction concepts and generalization. It is a natural way of integrating knowledge into reinforcement learning and promising to act as an enabler for autonomous and generalist behaviors in artificial intelligence systems.


Saturn: An Optimized Data System for Large Model Deep Learning Workloads

arXiv.org Artificial Intelligence

Large language models such as GPT-3 & ChatGPT have transformed deep learning (DL), powering applications that have captured the public's imagination. These models are rapidly being adopted across domains for analytics on various modalities, often by finetuning pre-trained base models. Such models need multiple GPUs due to both their size and computational load, driving the development of a bevy of "model parallelism" techniques & tools. Navigating such parallelism choices, however, is a new burden for end users of DL such as data scientists, domain scientists, etc. who may lack the necessary systems knowhow. The need for model selection, which leads to many models to train due to hyper-parameter tuning or layer-wise finetuning, compounds the situation with two more burdens: resource apportioning and scheduling. In this work, we tackle these three burdens for DL users in a unified manner by formalizing them as a joint problem that we call SPASE: Select a Parallelism, Allocate resources, and SchedulE. We propose a new information system architecture to tackle the SPASE problem holistically, representing a key step toward enabling wider adoption of large DL models. We devise an extensible template for existing parallelism schemes and combine it with an automated empirical profiler for runtime estimation. We then formulate SPASE as an MILP. We find that direct use of an MILP-solver is significantly more effective than several baseline heuristics. We optimize the system runtime further with an introspective scheduling approach. We implement all these techniques into a new data system we call Saturn. Experiments with benchmark DL workloads show that Saturn achieves 39-49% lower model selection runtimes than typical current DL practice.


Saturn: Efficient Multi-Large-Model Deep Learning

arXiv.org Artificial Intelligence

In this paper, we propose Saturn, a new data system to improve the efficiency of multi-large-model training (e.g., during model selection/hyperparameter optimization). We first identify three key interconnected systems challenges for users building large models in this setting -- parallelism technique selection, distribution of GPUs over jobs, and scheduling. We then formalize these as a joint problem, and build a new system architecture to tackle these challenges simultaneously. Our evaluations show that our joint-optimization approach yields 39-49% lower model selection runtimes than typical current DL practice.


Objects as Spatio-Temporal 2.5D points

arXiv.org Artificial Intelligence

Determining accurate bird's eye view (BEV) positions of objects and tracks in a scene is vital for various perception tasks including object interactions mapping, scenario extraction etc., however, the level of supervision required to accomplish that is extremely challenging to procure. We propose a light-weight, weakly supervised method to estimate 3D position of objects by jointly learning to regress the 2D object detections and scene's depth prediction in a single feed-forward pass of a network. Our proposed method extends a center-point based single-shot object detector, and introduces a novel object representation where each object is modeled as a BEV point spatio-temporally, without the need of any 3D or BEV annotations for training and LiDAR data at query time. The approach leverages readily available 2D object supervision along with LiDAR point clouds (used only during training) to jointly train a single network, that learns to predict 2D object detection alongside the whole scene's depth, to spatio-temporally model object tracks as points in BEV. The proposed method is computationally over $\sim$10x efficient compared to recent SOTA approaches while achieving comparable accuracies on KITTI tracking benchmark.


Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages

arXiv.org Artificial Intelligence

Cross-lingual dubbing of lecture videos requires the transcription of the original audio, correction and removal of disfluencies, domain term discovery, text-to-text translation into the target language, chunking of text using target language rhythm, text-to-speech synthesis followed by isochronous lipsyncing to the original video. This task becomes challenging when the source and target languages belong to different language families, resulting in differences in generated audio duration. This is further compounded by the original speaker's rhythm, especially for extempore speech. This paper describes the challenges in regenerating English lecture videos in Indian languages semi-automatically. A prototype is developed for dubbing lectures into 9 Indian languages. A mean-opinion-score (MOS) is obtained for two languages, Hindi and Tamil, on two different courses. The output video is compared with the original video in terms of MOS (1-5) and lip synchronisation with scores of 4.09 and 3.74, respectively. The human effort also reduces by 75%.


Predicting Eating Events in Free Living Individuals -- A Technical Report

arXiv.org Machine Learning

This technical report records the experiments of applying multiple machine learning algorithms for predicting eating and food purchasing behaviors of free-living individuals. Data was collected with accelerometer, global positioning system (GPS), and body-worn cameras called SenseCam over a one week period in 81 individuals from a variety of ages and demographic backgrounds. These data were turned into minute-level features from sensors as well as engineered features that included time (e.g., time since last eating) and environmental context (e.g., distance to nearest grocery store). Algorithms include Logistic Regression, RBF-SVM, Random Forest, and Gradient Boosting. Our results show that the Gradient Boosting model has the highest mean accuracy score (0.7289) for predicting eating events before 0 to 4 minutes. For predicting food purchasing events, the RBF-SVM model (0.7395) outperforms others. For both prediction models, temporal and spatial features were important contributors to predicting eating and food purchasing events.


SysML: The New Frontier of Machine Learning Systems

arXiv.org Machine Learning

Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, SysML, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two.


Belief dynamics extraction

arXiv.org Artificial Intelligence

Animal behavior is not driven simply by its current observations, but is strongly influenced by internal states. Estimating the structure of these internal states is crucial for understanding the neural basis of behavior. In principle, internal states can be estimated by inverting behavior models, as in inverse model-based Reinforcement Learning. However, this requires careful parameterization and risks model-mismatch to the animal. Here we take a data-driven approach to infer latent states directly from observations of behavior, using a partially observable switching semi-Markov process. This process has two elements critical for capturing animal behavior: it captures non-exponential distribution of times between observations, and transitions between latent states depend on the animal's actions, features that require more complex non-markovian models to represent. To demonstrate the utility of our approach, we apply it to the observations of a simulated optimal agent performing a foraging task, and find that latent dynamics extracted by the model has correspondences with the belief dynamics of the agent. Finally, we apply our model to identify latent states in the behaviors of monkey performing a foraging task, and find clusters of latent states that identify periods of time consistent with expectant waiting. This data-driven behavioral model will be valuable for inferring latent cognitive states, and thereby for measuring neural representations of those states.


Deep Domain Adaptation under Deep Label Scarcity

arXiv.org Artificial Intelligence

The goal behind Domain Adaptation (DA) is to leverage the labeled examples from a source domain so as to infer an accurate model in a target domain where labels are not available or in scarce at the best. A state-of-the-art approach for the DA is due to (Ganin et al. 2016), known as DANN, where they attempt to induce a common representation of source and target domains via adversarial training. This approach requires a large number of labeled examples from the source domain to be able to infer a good model for the target domain. However, in many situations obtaining labels in the source domain is expensive which results in deteriorated performance of DANN and limits its applicability in such scenarios. In this paper, we propose a novel approach to overcome this limitation. In our work, we first establish that DANN reduces the original DA problem into a semi-supervised learning problem over the space of common representation. Next, we propose a learning approach, namely TransDANN, that amalgamates adversarial learning and transductive learning to mitigate the detrimental impact of limited source labels and yields improved performance. Experimental results (both on text and images) show a significant boost in the performance of TransDANN over DANN under such scenarios. We also provide theoretical justification for the performance boost.