Goto

Collaborating Authors

 Calgary


Advancing IIoT with Over-the-Air Federated Learning: The Role of Iterative Magnitude Pruning

arXiv.org Artificial Intelligence

The industrial Internet of Things (IIoT) under Industry 4.0 heralds an era of interconnected smart devices where data-driven insights and machine learning (ML) fuse to revolutionize manufacturing. A noteworthy development in IIoT is the integration of federated learning (FL), which addresses data privacy and security among devices. FL enables edge sensors, also known as peripheral intelligence units (PIUs) to learn and adapt using their data locally, without explicit sharing of confidential data, to facilitate a collaborative yet confidential learning process. However, the lower memory footprint and computational power of PIUs inherently require deep neural network (DNN) models that have a very compact size. Model compression techniques such as pruning can be used to reduce the size of DNN models by removing unnecessary connections that have little impact on the model's performance, thus making the models more suitable for the limited resources of PIUs. Targeting the notion of compact yet robust DNN models, we propose the integration of iterative magnitude pruning (IMP) of the DNN model being trained in an over-the-air FL (OTA-FL) environment for IIoT. We provide a tutorial overview and also present a case study of the effectiveness of IMP in OTA-FL for an IIoT environment. Finally, we present future directions for enhancing and optimizing these deep compression techniques further, aiming to push the boundaries of IIoT capabilities in acquiring compact yet robust and high-performing DNN models.


TON-VIO: Online Time Offset Modeling Networks for Robust Temporal Alignment in High Dynamic Motion VIO

arXiv.org Artificial Intelligence

Temporal misalignment (time offset) between sensors is common in low cost visual-inertial odometry (VIO) systems. Such temporal misalignment introduces inconsistent constraints for state estimation, leading to a significant positioning drift especially in high dynamic motion scenarios. In this article, we focus on online temporal calibration to reduce the positioning drift caused by the time offset for high dynamic motion VIO. For the time offset observation model, most existing methods rely on accurate state estimation or stable visual tracking. For the prediction model, current methods oversimplify the time offset as a constant value with white Gaussian noise. However, these ideal conditions are seldom satisfied in real high dynamic scenarios, resulting in the poor performance. In this paper, we introduce online time offset modeling networks (TON) to enhance real-time temporal calibration. TON improves the accuracy of time offset observation and prediction modeling. Specifically, for observation modeling, we propose feature velocity observation networks to enhance velocity computation for features in unstable visual tracking conditions. For prediction modeling, we present time offset prediction networks to learn its evolution pattern. To highlight the effectiveness of our method, we integrate the proposed TON into both optimization-based and filter-based VIO systems. Simulation and real-world experiments are conducted to demonstrate the enhanced performance of our approach. Additionally, to contribute to the VIO community, we will open-source the code of our method on: https://github.com/Franky-X/FVON-TPN.


Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

arXiv.org Artificial Intelligence

In this paper, we extended the method proposed in [17] to enable humans to interact naturally with autonomous agents through vocal and textual conversations. Our extended method exploits the inherent capabilities of pre-trained large language models (LLMs), multimodal visual language models (VLMs), and speech recognition (SR) models to decode the high-level natural language conversations and semantic understanding of the robot's task environment, and abstract them to the robot's actionable commands or queries. We performed a quantitative evaluation of our framework's natural vocal conversation understanding with participants from different racial backgrounds and English language accents. The participants interacted with the robot using both spoken and textual instructional commands. Based on the logged interaction data, our framework achieved 87.55% vocal commands decoding accuracy, 86.27% commands execution success, and an average latency of 0.89 seconds from receiving the participants' vocal chat commands to initiating the robot's actual physical action. The video demonstrations of this paper can be found at https://linusnep.github.io/MTCC-IRoNL/.


Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations

arXiv.org Artificial Intelligence

This poses challenges for applications interested in targeting specific demographics (e.g., an African American business or NGO; a voice-tutoring system for children that are not of White ethnicity, etc.). The ultimate goal of the project described in this paper is to provide to designers, developers, and enterprises the choice of having a professional voice which is clearly recognizable as African American, and therefore more able to address diversity and inclusiveness issues. Being more precise, our goal is to create an African American Text-to-Speech system, which we will refer simply as an African American voice or AA voice, able to produce synthetic audio segments from standard English texts, and which will be recognized by African American speakers and non-speakers as sounding like a native African American speaker. The AA voice should exhibit a level of technical quality similar to the Standard American English (SAE) synthetic voices currently available through professional platforms. The evaluation of the technical quality of the AA voice, however, is not addressed in this paper, which focuses primarily on whether the AA voice can be recognized as sounding like an African American speaker. Linguists [27, 28] have described a continuum of dialects under what is often termed African American Vernacular English (AAVE). At one end of the spectrum, one finds the largest deviation from SAE in terms of lexicon (including slang), syntax and morphology, and phonological/phonetic properties. At the other end, AAVE speakers begin to approach SAE in terms of lexicon and grammar but still retain marked speech characteristics (primarily in terms of intonation, phonation, and vowel placement [14, 28]) which grant the speech a distinctive identity which listeners use as cues in the perception of African American English [44].


Initial Decoding with Minimally Augmented Language Model for Improved Lattice Rescoring in Low Resource ASR

arXiv.org Artificial Intelligence

This paper addresses the problem of improving speech recognition accuracy with lattice rescoring in low-resource languages where the baseline language model is insufficient for generating inclusive lattices. We minimally augment the baseline language model with word unigram counts that are present in a larger text corpus of the target language but absent in the baseline. The lattices generated after decoding with such an augmented baseline language model are more comprehensive. We obtain 21.8% (Telugu) and 41.8% (Kannada) relative word error reduction with our proposed method. This reduction in word error rate is comparable to 21.5% (Telugu) and 45.9% (Kannada) relative word error reduction obtained by decoding with full Wikipedia text augmented language mode while our approach consumes only 1/8th the memory. We demonstrate that our method is comparable with various text selection-based language model augmentation and also consistent for data sets of different sizes. Our approach is applicable for training speech recognition systems under low resource conditions where speech data and compute resources are insufficient, while there is a large text corpus that is available in the target language. Our research involves addressing the issue of out-of-vocabulary words of the baseline in general and does not focus on resolving the absence of named entities. Our proposed method is simple and yet computationally less expensive.


Rules still work for Open Information Extraction

arXiv.org Artificial Intelligence

Open information extraction (OIE) aims to extract surface relations and their corresponding arguments from natural language text, irrespective of domain. This paper presents an innovative OIE model, APRCOIE, tailored for Chinese text. Diverging from previous models, our model generates extraction patterns autonomously. The model defines a new pattern form for Chinese OIE and proposes an automated pattern generation methodology. In that way, the model can handle a wide array of complex and diverse Chinese grammatical phenomena. We design a preliminary filter based on tensor computing to conduct the extraction procedure efficiently. To train the model, we manually annotated a large-scale Chinese OIE dataset. In the comparative evaluation, we demonstrate that APRCOIE outperforms state-of-the-art Chinese OIE models and significantly expands the boundaries of achievable OIE performance. The code of APRCOIE and the annotated dataset are released on GitHub (https://github.com/jialin666/APRCOIE_v1)


More than words: Advancements and challenges in speech recognition for singing

arXiv.org Artificial Intelligence

This paper addresses the challenges and advancements in speech recognition for singing, a domain distinctly different from standard speech recognition. Singing encompasses unique challenges, including extensive pitch variations, diverse vocal styles, and background music interference. We explore key areas such as phoneme recognition, language identification in songs, keyword spotting, and full lyrics transcription. I will describe some of my own experiences when performing research on these tasks just as they were starting to gain traction, but will also show how recent developments in deep learning and large-scale datasets have propelled progress in this field. My goal is to illuminate the complexities of applying speech recognition to singing, evaluate current capabilities, and outline future research directions.


Training Self-localization Models for Unseen Unfamiliar Places via Teacher-to-Student Data-Free Knowledge Transfer

arXiv.org Artificial Intelligence

A typical assumption in state-of-the-art self-localization models is that an annotated training dataset is available in the target workspace. However, this does not always hold when a robot travels in a general open-world. This study introduces a novel training scheme for open-world distributed robot systems. In our scheme, a robot ("student") can ask the other robots it meets at unfamiliar places ("teachers") for guidance. Specifically, a pseudo-training dataset is reconstructed from the teacher model and thereafter used for continual learning of the student model. Unlike typical knowledge transfer schemes, our scheme introduces only minimal assumptions on the teacher model, such that it can handle various types of open-set teachers, including uncooperative, untrainable (e.g., image retrieval engines), and blackbox teachers (i.e., data privacy). Rather than relying on the availability of private data of teachers as in existing methods, we propose to exploit an assumption that holds universally in self-localization tasks: "The teacher model is a self-localization system" and to reuse the self-localization system of a teacher as a sole accessible communication channel. We particularly focus on designing an excellent student/questioner whose interactions with teachers can yield effective question-and-answer sequences that can be used as pseudo-training datasets for the student self-localization model. When applied to a generic recursive knowledge distillation scenario, our approach exhibited stable and consistent performance improvement.


Can Large Language Models Play Games? A Case Study of A Self-Play Approach

arXiv.org Artificial Intelligence

Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge. While LLMs have proven beneficial as decision-making aids, their reliability is hampered by limitations in reasoning, hallucination phenomenon, and so on. On the other hand, Monte-Carlo Tree Search (MCTS) is a heuristic search algorithm that provides reliable decision-making solutions, achieved through recursive rollouts and self-play. However, the effectiveness of MCTS relies heavily on heuristic pruning and external value functions, particularly in complex decision scenarios. This work introduces an innovative approach that bolsters LLMs with MCTS self-play to efficiently resolve deterministic turn-based zero-sum games (DTZG), such as chess and go, without the need for additional training. Specifically, we utilize LLMs as both action pruners and proxies for value functions without the need for additional training. We theoretically prove that the suboptimality of the estimated value in our proposed method scales with $\tilde{\mathcal O}\Bigl(\frac{|\tilde {\mathcal A}|}{\sqrt{N}} + \epsilon_\mathrm{pruner} + \epsilon_\mathrm{critic}\Bigr)$, where \(N\) is the number of simulations, $|\tilde {\mathcal A}|$ is the cardinality of the pruned action space by LLM, and $\epsilon_\mathrm{pruner}$ and $\epsilon_\mathrm{critic}$ quantify the errors incurred by adopting LLMs as action space pruner and value function proxy, respectively. Our experiments in chess and go demonstrate the capability of our method to address challenges beyond the scope of MCTS and improve the performance of the directly application of LLMs.


Tracing the Roots of Facts in Multilingual Language Models: Independent, Shared, and Transferred Knowledge

arXiv.org Artificial Intelligence

Acquiring factual knowledge for language models (LMs) in low-resource languages poses a serious challenge, thus resorting to cross-lingual transfer in multilingual LMs (ML-LMs). In this study, we ask how ML-LMs acquire and represent factual knowledge. Using the multilingual factual knowledge probing dataset, mLAMA, we first conducted a neuron investigation of ML-LMs (specifically, multilingual BERT). We then traced the roots of facts back to the knowledge source (Wikipedia) to identify the ways in which ML-LMs acquire specific facts. We finally identified three patterns of acquiring and representing facts in ML-LMs: language-independent, cross-lingual shared and transferred, and devised methods for differentiating them. Our findings highlight the challenge of maintaining consistent factual knowledge across languages, underscoring the need for better fact representation learning in ML-LMs.