part 3
Capturing Sparks of Abstraction for the ARC Challenge
Excellent progress has been made recently in solving ARC Challenge problems. However, it seems that new techniques may be required to push beyond 60% accuracy. Even commercial Large Language Models (LLMs) struggle to 'understand' many of the problems (when given the input and output grids), which makes discovering solutions by LLM-lead program search somewhat futile. In this work, LLM 'understanding' is attempted from a stronger starting position : An LLM is given complete solutions to tasks in code, and then asked to explain how the task is being solved at various levels of abstraction. Specifically, the LLM was given code solutions implemented in arc-dsl-llm (an LLM-legible version of Hodel's arc-dsl to obtain: (a) commented code; (b) code refactored into reusable functional chunks; (c) problem solution steps; and (d) high-level problem-solving tactics. We demonstrate that 'Sparks of Abstraction' can be extracted from the LLM output - in a form that could be used in downstream tasks with Local LLMs eligible to enter the ARC Prize. Both the arc-dsl-llm DSL framework (with the re-engineered solutions) and the Gemini LLM-generated data (along with the generation code) are made Open Source.
- Workflow (0.68)
- Research Report (0.65)
Quantum Speedup for Spectral Approximation of Kronecker Products
Gao, Yeqi, Song, Zhao, Zhang, Ruizhe
Given its widespread application in machine learning and optimization, the Kronecker product emerges as a pivotal linear algebra operator. However, its computational demands render it an expensive operation, leading to heightened costs in spectral approximation of it through traditional computation algorithms. Existing classical methods for spectral approximation exhibit a linear dependency on the matrix dimension denoted by $n$, considering matrices of size $A_1 \in \mathbb{R}^{n \times d}$ and $A_2 \in \mathbb{R}^{n \times d}$. Our work introduces an innovative approach to efficiently address the spectral approximation of the Kronecker product $A_1 \otimes A_2$ using quantum methods. By treating matrices as quantum states, our proposed method significantly reduces the time complexity of spectral approximation to $O_{d,\epsilon}(\sqrt{n})$.
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > Germany (0.04)
- Africa > Senegal > Kolda Region > Kolda (0.04)
- Overview (0.87)
- Research Report > Promising Solution (0.34)
Local Convergence of Approximate Newton Method for Two Layer Nonlinear Regression
Li, Zhihang, Song, Zhao, Wang, Zifan, Yin, Junze
There have been significant advancements made by large language models (LLMs) in various aspects of our daily lives. LLMs serve as a transformative force in natural language processing, finding applications in text generation, translation, sentiment analysis, and question-answering. The accomplishments of LLMs have led to a substantial increase in research efforts in this domain. One specific two-layer regression problem has been well-studied in prior works, where the first layer is activated by a ReLU unit, and the second layer is activated by a softmax unit. While previous works provide a solid analysis of building a two-layer regression, there is still a gap in the analysis of constructing regression problems with more than two layers. In this paper, we take a crucial step toward addressing this problem: we provide an analysis of a two-layer regression problem. In contrast to previous works, our first layer is activated by a softmax unit. This sets the stage for future analyses of creating more activation functions based on the softmax function. Rearranging the softmax function leads to significantly different analyses. Our main results involve analyzing the convergence properties of an approximate Newton method used to minimize the regularized training loss. We prove that the loss function for the Hessian matrix is positive definite and Lipschitz continuous under certain assumptions. This enables us to establish local convergence guarantees for the proposed training algorithm. Specifically, with an appropriate initialization and after $O(\log(1/\epsilon))$ iterations, our algorithm can find an $\epsilon$-approximate minimizer of the training loss with high probability. Each iteration requires approximately $O(\mathrm{nnz}(C) + d^\omega)$ time, where $d$ is the model size, $C$ is the input matrix, and $\omega < 2.374$ is the matrix multiplication exponent.
- North America > United States > Virginia (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Workflow (0.70)
- Research Report (0.49)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.64)
A Theoretical Insight into Attack and Defense of Gradient Leakage in Transformer
Li, Chenyang, Song, Zhao, Wang, Weixin, Yang, Chiwun
The Deep Leakage from Gradient (DLG) attack has emerged as a prevalent and highly effective method for extracting sensitive training data by inspecting exchanged gradients. This approach poses a substantial threat to the privacy of individuals and organizations alike. This research presents a comprehensive analysis of the gradient leakage method when applied specifically to transformer-based models. Through meticulous examination, we showcase the capability to accurately recover data solely from gradients and rigorously investigate the conditions under which gradient attacks can be executed, providing compelling evidence. Furthermore, we reevaluate the approach of introducing additional noise on gradients as a protective measure against gradient attacks. To address this, we outline a theoretical proof that analyzes the associated privacy costs within the framework of differential privacy. Additionally, we affirm the convergence of the Stochastic Gradient Descent (SGD) algorithm under perturbed gradients. The primary objective of this study is to augment the understanding of gradient leakage attack and defense strategies while actively contributing to the development of privacy-preserving techniques specifically tailored for transformer-based models. By shedding light on the vulnerabilities and countermeasures associated with gradient leakage, this research aims to foster advancements in safeguarding sensitive data and upholding privacy in the context of transformer-based models.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.13)
- North America > United States > Virginia (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Fujian Province > Fuzhou (0.04)
- Overview (1.00)
- Research Report > New Finding (0.92)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Physics of Language Models: Part 3.2, Knowledge Manipulation
Allen-Zhu, Zeyuan, Li, Yuanzhi
Language models can store vast amounts of factual knowledge, but their ability to use this knowledge for logical reasoning remains questionable. This paper explores a language model's ability to manipulate its stored knowledge during inference. We focus on four manipulation types: retrieval (e.g., "What is person A's attribute X"), classification (e.g., "Is A's attribute X even or odd?"), comparison (e.g., "Is A greater than B in attribute X?") and inverse search (e.g., "Which person's attribute X equals T?") We observe that pre-trained language models like GPT2/3/4 excel in knowledge retrieval but struggle with simple classification or comparison tasks unless Chain of Thoughts (CoTs) are employed during both training and inference. They also perform poorly in inverse knowledge search, irrespective of the prompts. Our primary contribution is a synthetic dataset for a controlled experiment that confirms these inherent weaknesses: a language model cannot efficiently manipulate knowledge from pre-training data, even when such knowledge is perfectly stored and fully extractable in the models, and despite adequate instruct fine-tuning.
How to Train Time Series Forecasting Faster using Ray, part 3 of 3
Even in the current age of Generative AI (Stable Diffusion, ChatGPT) and LLM (large language models), Time Series Forecasting is still a fundamental part of running any business that depends on a supply chain or resources. One thing all these use cases have in common is training many models on different segments of data. Training, tuning, and deploying thousands of machine learning models in parallel using distributed computing can be a challenging task! Typical time series modeling software is not distributed by itself. This blog will show my tips to get started converting your forecasting workloads to distributed computing.
Continuations by Albert Wenger : Thinking About AI: Part 3 - Existential Risk...
Now we are getting to the biggest and weirdest risk of AI: a super intelligence emerging and wiping out humanity in pursuit of its own goals. To a lot of people this seems like a totally absurd idea, held only by a tiny fringe of people who appear weird and borderline culty. It seems so far out there and also so huge that most people wind up dismissing it and/or forgetting about shortly after hearing it. There is a big similarity here to the climate crisis, where the more extreme views are widely dismissed. In case you have not encountered the argument yet, let me give a very brief summary (Nick Bostrom has an entire book on the topic and Eliezer Yudkowsky has been blogging about it for two decades, so this will be super compressed by comparison): A superintelligence when it emerges will be pursuing its own set of goals.
Optimize AI/ML workloads for sustainability: Part 3, deployment and monitoring
We're celebrating Earth Day 2022 from 4/22 through 4/29 with posts that highlight how to build, maintain, and refine your workloads for sustainability. AWS estimates that inference (the process of using a trained machine learning [ML] algorithm to make a prediction) makes up 90 percent of the cost of an ML model. Given with AWS you pay for what you use, we estimate that inference also generally equates to most of the resource usage within an ML lifecycle. In Part 3, our final piece in the series, we show you how to reduce the environmental impact of your ML workload once your model is in production. If you missed the first parts of this series, in Part 1, we showed you how to examine your workload to help you 1) evaluate the impact of your workload, 2) identify alternatives to training your own model, and 3) optimize data processing.
The Spatial Web is Coming -- Part 3
Enter The Spatial Web Foundation and VERSES Technologies, a next-gen AI company that is literally laying the foundation for the Spatial Web Protocol by establishing and defining an entirely new computing technology stack comprised of three tiers: Interface, Logic & Data. VERSES has created the Hyperspace Transaction Protocol (HSTP), using Hyperspace Modeling Language (HSML), as the foundation for a common networked terminal, to bring all the interface tier components together in order to facilitate an indexed and searchable Spatial Web Browser of every person, place or thing, both real and digital. As Dan Mapes of VERSES points out, "HTML lets you program a web page -- HSML lets you program a web space." The Logic Tier enables the parsing of this huge amount of new spatial & UX data through cognitive computing methods, powered by VERSES' flagship contextual computing AI Operating System called, COSM . VERSES is Blockchain agnostic which means you can use multiple chains and even operate a hybrid data layer using both DLT technologies and the cloud.
ABC of Deep Learning (Part 3 of 5)
As referred previously, the gradient descent algorithm is an optimization technique used to find the weights and bias values that minimize a cost function. The backpropagation algorithm is a method of training neural networks that uses gradient descent to minimize the cost function. Backpropagation is a fast and efficient method of training a neural network. Before explaining the backpropagation algorithm, it's crucial to describe the equation behind any artificial neural network. A neural network can be represented by a composition of multivariate functions.