Latapie, Hugo
Towards A Litmus Test for Common Sense
Latapie, Hugo
This paper is the second in a planned series aimed at envisioning a path to safe and beneficial artificial intelligence. Building on the conceptual insights of "Common Sense Is All You Need," we propose a more formal litmus test for common sense, adopting an axiomatic approach that combines minimal prior knowledge (MPK) constraints with diagonal or Godel-style arguments to create tasks beyond the agent's known concept set. We discuss how this approach applies to the Abstraction and Reasoning Corpus (ARC), acknowledging training/test data constraints, physical or virtual embodiment, and large language models (LLMs). We also integrate observations regarding emergent deceptive hallucinations, in which more capable AI systems may intentionally fabricate plausible yet misleading outputs to disguise knowledge gaps. The overarching theme is that scaling AI without ensuring common sense risks intensifying such deceptive tendencies, thereby undermining safety and trust. Aligning with the broader goal of developing beneficial AI without causing harm, our axiomatic litmus test not only diagnoses whether an AI can handle truly novel concepts but also provides a stepping stone toward an ethical, reliable foundation for future safe, beneficial, and aligned artificial intelligence.
Common Sense Is All You Need
Latapie, Hugo
Artificial intelligence (AI) has made significant strides in recent years, yet it continues to struggle with a fundamental aspect of cognition present in all animals: common sense. Current AI systems, including those designed for complex tasks like autonomous driving, problem-solving challenges such as the Abstraction and Reasoning Corpus (ARC), and conversational benchmarks like the Turing Test, often lack the ability to adapt to new situations without extensive prior knowledge. This manuscript argues that integrating common sense into AI systems is essential for achieving true autonomy and unlocking the full societal and commercial value of AI. We propose a shift in the order of knowledge acquisition emphasizing the importance of developing AI systems that start from minimal prior knowledge and are capable of contextual learning, adaptive reasoning, and embodiment -- even within abstract domains. Additionally, we highlight the need to rethink the AI software stack to address this foundational challenge. Without common sense, AI systems may never reach true autonomy, instead exhibiting asymptotic performance that approaches theoretical ideals like AIXI but remains unattainable in practice due to infinite resource and computation requirements. While scaling AI models and passing benchmarks like the Turing Test have brought significant advancements in applications that do not require autonomy, these approaches alone are insufficient to achieve autonomous AI with common sense. By redefining existing benchmarks and challenges to enforce constraints that require genuine common sense, and by broadening our understanding of embodiment to include both physical and abstract domains, we can encourage the development of AI systems better equipped to handle the complexities of real-world and abstract environments.
D{\epsilon}pS: Delayed {\epsilon}-Shrinking for Faster Once-For-All Training
Annavajjala, Aditya, Khare, Alind, Agrawal, Animesh, Fedorov, Igor, Latapie, Hugo, Lee, Myungjin, Tumanov, Alexey
CNNs are increasingly deployed across different hardware, dynamic environments, and low-power embedded devices. This has led to the design and training of CNN architectures with the goal of maximizing accuracy subject to such variable deployment constraints. As the number of deployment scenarios grows, there is a need to find scalable solutions to design and train specialized CNNs. Once-for-all training has emerged as a scalable approach that jointly co-trains many models (subnets) at once with a constant training cost and finds specialized CNNs later. The scalability is achieved by training the full model and simultaneously reducing it to smaller subnets that share model weights (weight-shared shrinking). However, existing once-for-all training approaches incur huge training costs reaching 1200 GPU hours. We argue this is because they either start the process of shrinking the full model too early or too late. Hence, we propose Delayed $\epsilon$-Shrinking (D$\epsilon$pS) that starts the process of shrinking the full model when it is partially trained (~50%) which leads to training cost improvement and better in-place knowledge distillation to smaller models. The proposed approach also consists of novel heuristics that dynamically adjust subnet learning rates incrementally (E), leading to improved weight-shared knowledge distillation from larger to smaller subnets as well. As a result, DEpS outperforms state-of-the-art once-for-all training techniques across different datasets including CIFAR10/100, ImageNet-100, and ImageNet-1k on accuracy and cost. It achieves 1.83% higher ImageNet-1k top1 accuracy or the same accuracy with 1.3x reduction in FLOPs and 2.5x drop in training cost (GPU*hrs)
HDReason: Algorithm-Hardware Codesign for Hyperdimensional Knowledge Graph Reasoning
Chen, Hanning, Ni, Yang, Zakeri, Ali, Zou, Zhuowen, Yun, Sanggeon, Wen, Fei, Khaleghi, Behnam, Srinivasa, Narayan, Latapie, Hugo, Imani, Mohsen
In recent times, a plethora of hardware accelerators have been put forth for graph learning applications such as vertex classification and graph classification. However, previous works have paid little attention to Knowledge Graph Completion (KGC), a task that is well-known for its significantly higher algorithm complexity. The state-of-the-art KGC solutions based on graph convolution neural network (GCN) involve extensive vertex/relation embedding updates and complicated score functions, which are inherently cumbersome for acceleration. As a result, existing accelerator designs are no longer optimal, and a novel algorithm-hardware co-design for KG reasoning is needed. Recently, brain-inspired HyperDimensional Computing (HDC) has been introduced as a promising solution for lightweight machine learning, particularly for graph learning applications. In this paper, we leverage HDC for an intrinsically more efficient and acceleration-friendly KGC algorithm. We also co-design an acceleration framework named HDReason targeting FPGA platforms. On the algorithm level, HDReason achieves a balance between high reasoning accuracy, strong model interpretability, and less computation complexity. In terms of architecture, HDReason offers reconfigurability, high training throughput, and low energy consumption. When compared with NVIDIA RTX 4090 GPU, the proposed accelerator achieves an average 10.6x speedup and 65x energy efficiency improvement. When conducting cross-models and cross-platforms comparison, HDReason yields an average 4.2x higher performance and 3.4x better energy efficiency with similar accuracy versus the state-of-the-art FPGA-based GCN training platform.
Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments
Gu, Yu, Shu, Yiheng, Yu, Hao, Liu, Xiao, Dong, Yuxiao, Tang, Jie, Srinivasa, Jayanth, Latapie, Hugo, Su, Yu
The applications of large language models (LLMs) have expanded well beyond the confines of text processing, signaling a new era where LLMs are envisioned as generalist language agents capable of operating within complex real-world environments. These environments are often highly expansive, making it impossible for the LLM to process them within its short-term memory. Motivated by recent research on extending the capabilities of LLMs with tools, this paper investigates the intriguing potential of tools to augment LLMs in handling such complexity. To this end, we design customized tools to aid in the proactive exploration within these massive environments. Such tools can serve as a middleware layer shielding the LLM from environmental complexity. In two representative complex environments -- knowledge bases (KBs) and databases -- we demonstrate the significant potential of augmenting language agents with tools in complex environments. Notably, equipped with these tools, GPT-4 achieves 2.8X the performance of the best baseline in tasks requiring access to database content and 2.2X in KB tasks. Our findings illuminate the path for advancing language agents in complex real-world applications.
Adaptive Deep Neural Network Inference Optimization with EENet
Ilhan, Fatih, Chow, Ka-Ho, Hu, Sihao, Huang, Tiansheng, Tekin, Selim, Wei, Wenqi, Wu, Yanzhao, Lee, Myungjin, Kompella, Ramana, Latapie, Hugo, Liu, Gaowen, Liu, Ling
Well-trained deep neural networks (DNNs) treat all test samples equally during prediction. Adaptive DNN inference with early exiting leverages the observation that some test examples can be easier to predict than others. This paper presents EENet, a novel early-exiting scheduling framework for multi-exit DNN models. Instead of having every sample go through all DNN layers during prediction, EENet learns an early exit scheduler, which can intelligently terminate the inference earlier for certain predictions, which the model has high confidence of early exit. As opposed to previous early-exiting solutions with heuristics-based methods, our EENet framework optimizes an early-exiting policy to maximize model accuracy while satisfying the given per-sample average inference budget. Extensive experiments are conducted on four computer vision datasets (CIFAR-10, CIFAR-100, ImageNet, Cityscapes) and two NLP datasets (SST-2, AgNews). The results demonstrate that the adaptive inference by EENet can outperform the representative existing early exit techniques. We also perform a detailed visualization analysis of the comparison results to interpret the benefits of EENet.
VQPy: An Object-Oriented Approach to Modern Video Analytics
Yu, Shan, Zhu, Zhenting, Chen, Yu, Xu, Hanchen, Zhao, Pengzhan, Wang, Yang, Padmanabhan, Arthi, Latapie, Hugo, Xu, Harry
Video analytics is widely used in contemporary systems and services. At the forefront of video analytics are video queries that users develop to find objects of particular interest. Building upon the insight that video objects (e.g., human, animals, cars, etc.), the center of video analytics, are similar in spirit to objects modeled by traditional object-oriented languages, we propose to develop an object-oriented approach to video analytics. This approach, named VQPy, consists of a frontend$\unicode{x2015}$a Python variant with constructs that make it easy for users to express video objects and their interactions$\unicode{x2015}$as well as an extensible backend that can automatically construct and optimize pipelines based on video objects. We have implemented and open-sourced VQPy, which has been productized in Cisco as part of its DeepVision framework.
A Retrieve-and-Read Framework for Knowledge Graph Link Prediction
Pahuja, Vardaan, Wang, Boshi, Latapie, Hugo, Srinivasa, Jayanth, Su, Yu
Knowledge graph (KG) link prediction aims to infer new facts based on existing facts in the KG. Recent studies have shown that using the graph neighborhood of a node via graph neural networks (GNNs) provides more useful information compared to just using the query information. Conventional GNNs for KG link prediction follow the standard message-passing paradigm on the entire KG, which leads to superfluous computation, over-smoothing of node representations, and also limits their expressive power. On a large scale, it becomes computationally expensive to aggregate useful information from the entire KG for inference. To address the limitations of existing KG link prediction frameworks, we propose a novel retrieve-and-read framework, which first retrieves a relevant subgraph context for the query and then jointly reasons over the context and the query with a high-capacity reader. As part of our exemplar instantiation for the new framework, we propose a novel Transformer-based GNN as the reader, which incorporates graph-based attention structure and cross-attention between query and context for deep fusion. This simple yet effective design enables the model to focus on salient context information relevant to the query. Empirical results on two standard KG link prediction datasets demonstrate the competitive performance of the proposed method. Furthermore, our analysis yields valuable insights for designing improved retrievers within the framework.
RecallM: An Adaptable Memory Mechanism with Temporal Understanding for Large Language Models
Kynoch, Brandon, Latapie, Hugo, van der Sluis, Dwane
Large Language Models (LLMs) have made extraordinary progress in the field of Artificial Intelligence and have demonstrated remarkable capabilities across a large variety of tasks and domains. However, as we venture closer to creating Artificial General Intelligence (AGI) systems, we recognize the need to supplement LLMs with long-term memory to overcome the context window limitation and more importantly, to create a foundation for sustained reasoning, cumulative learning and long-term user interaction. In this paper we propose RecallM, a novel architecture for providing LLMs with an adaptable and updatable long-term memory mechanism. Unlike previous methods, the RecallM architecture is particularly effective at belief updating and maintaining a temporal understanding of the knowledge provided to it. We demonstrate through various experiments the effectiveness of this architecture. Furthermore, through our own temporal understanding and belief updating experiments, we show that RecallM is four times more effective than using a vector database for updating knowledge previously stored in long-term memory. We also demonstrate that RecallM shows competitive performance on general question-answering and in-context learning tasks.
Ethosight: A Reasoning-Guided Iterative Learning System for Nuanced Perception based on Joint-Embedding & Contextual Label Affinity
Latapie, Hugo, Yu, Shan, Hammer, Patrick, Thorisson, Kristinn R., Petrosyan, Vahagn, Kynoch, Brandon, Khare, Alind, Behnam, Payman, Tumanov, Alexey, Saxena, Aksheit, Aralikatti, Anish, Chen, Hanning, Imani, Mohsen, Archbold, Mike, Li, Tangrui, Wang, Pei, Hart, Justin
Traditional computer vision models often necessitate extensive data acquisition, annotation, and validation. These models frequently struggle in real-world applications, resulting in high false positive and negative rates, and exhibit poor adaptability to new scenarios, often requiring costly retraining. To address these issues, we present Ethosight, a flexible and adaptable zero-shot video analytics system. Ethosight begins from a clean slate based on user-defined video analytics, specified through natural language or keywords, and leverages joint embedding models and reasoning mechanisms informed by ontologies such as WordNet and ConceptNet. Ethosight operates effectively on low-cost edge devices and supports enhanced runtime adaptation, thereby offering a new approach to continuous learning without catastrophic forgetting. We provide empirical validation of Ethosight's promising effectiveness across diverse and complex use cases, while highlighting areas for further improvement. A significant contribution of this work is the release of all source code and datasets to enable full reproducibility and to foster further innovation in both the research and commercial domains.