Dhanbad
Practical Deep Learning with Bayesian Principles
Kazuki Osawa, Siddharth Swaroop, Mohammad Emtiyaz E. Khan, Anirudh Jain, Runa Eschenhagen, Richard E. Turner, Rio Yokota
Figure 2: distributed calculation algorithmic Momentum Itiswell improv to Adam, where 1isthemomentumµin in Adaminit.xavier_normalin V methods, and AUR andissecond-best significantly and Adam Wealsosho7] in Figures itscalibration ImageNet, required Wealso different protocol 16,31,8,32] tocompare Wealsoborro16,30], sho reporting Ideally, we data.
Behind Maya: Building a Multilingual Vision Language Model
Alam, Nahid, Kanjula, Karthik Reddy, Guthikonda, Surya, Chung, Timothy, Vegesna, Bala Krishna S, Das, Abhipsha, Susevski, Anthony, Chan, Ryan Sze-Yin, Uddin, S M Iftekhar, Islam, Shayekh Bin, Santhosh, Roshan, A, Snegha, Sharma, Drishti, Liu, Chen, Chaturvedi, Isha, Winata, Genta Indra, S, Ashvanth., Mukherjee, Snehanshu, Aji, Alham Fikri
In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. T o address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks.
Optimizing Multi-DNN Inference on Mobile Devices through Heterogeneous Processor Co-Execution
Gao, Yunquan, Zhang, Zhiguo, Donta, Praveen Kumar, Dehury, Chinmaya Kumar, Wang, Xiujun, Niyato, Dusit, Zhang, Qiyang
Abstract--Deep Neural Networks (DNNs) are increasingly deployed across diverse industries, driving a growing demand to enable their capabilities on mobile devices. However, existing mobile inference frameworks are often rely on a single processor to handle each model's inference, limiting hardware utilization and leading to suboptimal performance and energy efficiency . Expanding DNNs accessibility on mobile platforms requires more adaptive and resource-efficient solutions to meet increasing computational demands without compromising device functionality . Nevertheless, parallel inference of multiple DNNs on heterogeneous processors remains a significant challenge. Several works have explored partitioning DNN operations into subgraphs to enable parallel execution across heterogeneous processors. However, these approaches typically generate excessive subgraphs based solely on hardware compatibility, increasing scheduling complexity and memory management overhead. T o address these limitations, we propose an Advanced Multi-DNN Model Scheduling (ADMS) strategy that optimizes multi-DNN inference across heterogeneous processors on mobile devices. ADMS constructs an optimal subgraph partitioning strategy offline, considering both hardware support of operations and scheduling granularity, while employing a processor-state-aware scheduling algorithm that dynamically balances workloads based on real-time operational conditions. This ensures efficient workload distribution and maximizes the utilization of available processors. Experimental results show that, compared to vanilla inference frameworks, ADMS reduced multi-DNN inference latency by 4.04 T o reduce interaction latency and lower server-side computing costs, an increasing number of applications are shifting inference tasks to mobile devices. In many real-world scenarios, multiple independent or related DNN models run concurrently on mobile devices. For instance, in the smart agriculture scenario, farmers capture video frames using smartphone camera and perform real-time parallel inference with multiple DNN models. These models include crop identification [5], pest and disease detection [6], plant health assessment [7], and soil quality analysis [8]. Gao, X. Wang are with School of Computer Science and T echnology, Anhui Engineering Research Center for Intelligent Applications and Security of Industrial Internet, Anhui University of T echnology, Ma'anshan, Anhui, 243032, China.
Split-n-Chain: Privacy-Preserving Multi-Node Split Learning with Blockchain-Based Auditability
Sahani, Mukesh, Sengupta, Binanda
Deep learning, when integrated with a large amount of training data, has the potential to outperform machine learning in terms of high accuracy. Recently, privacy-preserving deep learning has drawn significant attention of the research community. Different privacy notions in deep learning include privacy of data provided by data-owners and privacy of parameters and/or hyperparameters of the underlying neural network. Federated learning is a popular privacy-preserving execution environment where data-owners participate in learning the parameters collectively without leaking their respective data to other participants. However, federated learning suffers from certain security/privacy issues. In this paper, we propose Split-n-Chain, a variant of split learning where the layers of the network are split among several distributed nodes. Split-n-Chain achieves several privacy properties: data-owners need not share their training data with other nodes, and no nodes have access to the parameters and hyperparameters of the neural network (except that of the respective layers they hold). Moreover, Split-n-Chain uses blockchain to audit the computation done by different nodes. Our experimental results show that: Split-n-Chain is efficient, in terms of time required to execute different phases, and the training loss trend is similar to that for the same neural network when implemented in a monolithic fashion.
SplatR : Experience Goal Visual Rearrangement with 3D Gaussian Splatting and Dense Feature Matching
S, Arjun P, Melnik, Andrew, Nandi, Gora Chand
Experience Goal Visual Rearrangement task stands as a However, these methods have disadvantages: 2D and 3D foundational challenge within Embodied AI, requiring an semantic maps store object pose and semantic information agent to construct a robust world model that accurately in a grid; this approach provides limited resolution, does captures the goal state. The agent uses this world model to not inherently capture interactions between objects and is restore a shuffled scene to its original configuration, making prone to sensitivity issues and quantization errors. Although an accurate representation of the world essential for pointcloud based representation can provide more robustness successfully completing the task. In this work, we present to sensitivity, it lacks structural semantics: identifying a novel framework that leverages on 3D Gaussian Splatting objects and their interactions with the world in a noisy as a 3D scene representation for experience goal visual pointcloud is challenging. Scene graph based methods often rearrangement task. Recent advances in volumetric assume a clear and well defined relationship between scene representation like 3D Gaussian Splatting, offer fast objects, which often limits the granularity of scene understanding, rendering of high quality and photo-realistic novel views.
Maya: An Instruction Finetuned Multilingual Multimodal Model
Alam, Nahid, Kanjula, Karthik Reddy, Guthikonda, Surya, Chung, Timothy, Vegesna, Bala Krishna S, Das, Abhipsha, Susevski, Anthony, Chan, Ryan Sze-Yin, Uddin, S M Iftekhar, Islam, Shayekh Bin, Santhosh, Roshan, A, Snegha, Sharma, Drishti, Liu, Chen, Chaturvedi, Isha, Winata, Genta Indra, S, Ashvanth., Mukherjee, Snehanshu, Aji, Alham Fikri
The rapid development of large Vision-Language Models (VLMs) has led to impressive results on academic benchmarks, primarily in widely spoken languages. However, significant gaps remain in the ability of current VLMs to handle low-resource languages and varied cultural contexts, largely due to a lack of high-quality, diverse, and safety-vetted data. Consequently, these models often struggle to understand low-resource languages and cultural nuances in a manner free from toxicity. To address these limitations, we introduce Maya, an open-source Multimodal Multilingual model. Our contributions are threefold: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; 2) a thorough analysis of toxicity within the LLaVA dataset, followed by the creation of a novel toxicity-free version across eight languages; and 3) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.
packetLSTM: Dynamic LSTM Framework for Streaming Data with Varying Feature Space
Agarwal, Rohit, Naidu, Karaka Prasanth, Horsch, Alexander, Agarwal, Krishna, Prasad, Dilip K.
We study the online learning problem characterized by the varying input feature space of streaming data. Although LSTMs have been employed to effectively capture the temporal nature of streaming data, they cannot handle the dimension-varying streams in an online learning setting. Therefore, we propose a dynamic LSTM-based novel method, called packetLSTM, to model the dimension-varying streams. The packetLSTM's dynamic framework consists of an evolving packet of LSTMs, each dedicated to processing one input feature. Each LSTM retains the local information of its corresponding feature, while a shared common memory consolidates global information. This configuration facilitates continuous learning and mitigates the issue of forgetting, even when certain features are absent for extended time periods. The idea of utilizing one LSTM per feature coupled with a dimension-invariant operator for information aggregation enhances the dynamic nature of packetLSTM. This dynamic nature is evidenced by the model's ability to activate, deactivate, and add new LSTMs as required, thus seamlessly accommodating varying input dimensions. The packetLSTM achieves state-of-the-art results on five datasets, and its underlying principle is extended to other RNN types, like GRU and vanilla RNN.
Physics Informed Kolmogorov-Arnold Neural Networks for Dynamical Analysis via Efficent-KAN and WAV-KAN
Patra, Subhajit, Panda, Sonali, Parida, Bikram Keshari, Arya, Mahima, Jacobs, Kurt, Bondar, Denys I., Sen, Abhijit
However, traditional deep neural networks often face challenges in achieving high accuracy without incurring significant computational costs. In this work, we implement the Physics-Informed Kolmogorov-Arnold Neural Networks (PIKAN) through efficient-KAN and WAV-KAN, which utilize the Kolmogorov-Arnold representation theorem. PIKAN demonstrates superior performance compared to conventional deep neural networks, achieving the same level of accuracy with fewer layers and reduced computational overhead. We explore both B-spline and wavelet-based implementations of PIKAN and benchmark their performance across various ordinary and partial differential equations using unsupervised (data-free) and supervised (data-driven) techniques. For certain differential equations, the data-free approach suffices to find accurate solutions, while in more complex scenarios, the data-driven method enhances the PIKAN's ability to converge to the correct solution. We validate our results against numerical solutions and achieve 99% accuracy in most scenarios. I. INTRODUCTION The advent of deep learning and its use cases in solving complicated tasks related to computer vision, natural language processing, speech, etc., has led to state-of-the-art applications in industries like healthcare, finance, robotics, to name a few. Further, using deep neural networks (DNNs) in solving differential equations through Physics Informed Neural Networks (PINNs) is another breakthrough that offered a new framework for solving partial differential equations [1]. Since then the field of PINN has received a lot of attention (e.g., see review [2]) and is extended to solve fractional equations, integral-differential equations, and stochastic partial differential equations [3-5]. PINN has been developed to be more robust and accurate [6] because the original form of PINN has drawbacks [7-12], which are emanate from deep networks. Recently, a promising alternative to the traditional multilayer perceptron has been proposed: the Kolmogorov-Arnold Neural Network (KAN) [13].
Towards Open-World Mobile Manipulation in Homes: Lessons from the Neurips 2023 HomeRobot Open Vocabulary Mobile Manipulation Challenge
Yenamandra, Sriram, Ramachandran, Arun, Khanna, Mukul, Yadav, Karmesh, Vakil, Jay, Melnik, Andrew, Büttner, Michael, Harz, Leon, Brown, Lyon, Nandi, Gora Chand, PS, Arjun, Yadav, Gaurav Kumar, Kala, Rahul, Haschke, Robert, Luo, Yang, Zhu, Jinxin, Han, Yansen, Lu, Bingyi, Gu, Xuan, Liu, Qinyuan, Zhao, Yaping, Ye, Qiting, Dou, Chenxiao, Chua, Yansong, Kuzma, Volodymyr, Humennyy, Vladyslav, Partsey, Ruslan, Francis, Jonathan, Chaplot, Devendra Singh, Chhablani, Gunjan, Clegg, Alexander, Gervet, Theophile, Jain, Vidhi, Ramrakhya, Ram, Szot, Andrew, Wang, Austin, Yang, Tsung-Yen, Edsinger, Aaron, Kemp, Charlie, Shah, Binit, Kira, Zsolt, Batra, Dhruv, Mottaghi, Roozbeh, Bisk, Yonatan, Paxton, Chris
In order to develop robots that can effectively serve as versatile and capable home assistants, it is crucial for them to reliably perceive and interact with a wide variety of objects across diverse environments. To this end, we proposed Open Vocabulary Mobile Manipulation as a key benchmark task for robotics: finding any object in a novel environment and placing it on any receptacle surface within that environment. We organized a NeurIPS 2023 competition featuring both simulation and real-world components to evaluate solutions to this task. Our baselines on the most challenging version of this task, using real perception in simulation, achieved only an 0.8% success rate; by the end of the competition, the best participants achieved an 10.8\% success rate, a 13x improvement. We observed that the most successful teams employed a variety of methods, yet two common threads emerged among the best solutions: enhancing error detection and recovery, and improving the integration of perception with decision-making processes. In this paper, we detail the results and methodologies used, both in simulation and real-world settings. We discuss the lessons learned and their implications for future research. Additionally, we compare performance in real and simulated environments, emphasizing the necessity for robust generalization to novel settings.