layer number
Appendix of Joint Data-T ask Generation for Auxiliary Learning Hong Chen
We provide the derivation of the upper implicit gradient in eq. We summarize the whole DTG-AuxL algorithm in Algorithm 1, where the lower and upper optimization updates are conducted alternatingly. We use the batch stochastic gradient optimization for both the lower and upper update. STL: It is a natural baseline where we only train on the primary task. Equal: It is a multi-task learning method, where we assign an equal weight of 1.0 to the loss of each MAXL can be only applied to the classification problem.
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- (4 more...)
Appendix of Joint Data-T ask Generation for Auxiliary Learning Hong Chen
We provide the derivation of the upper implicit gradient in eq. We summarize the whole DTG-AuxL algorithm in Algorithm 1, where the lower and upper optimization updates are conducted alternatingly. We use the batch stochastic gradient optimization for both the lower and upper update. STL: It is a natural baseline where we only train on the primary task. Equal: It is a multi-task learning method, where we assign an equal weight of 1.0 to the loss of each MAXL can be only applied to the classification problem.
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- (4 more...)
Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash
Jia, Fucheng, Wu, Zewen, Jiang, Shiqi, Jiang, Huiqiang, Zhang, Qianxi, Yang, Yuqing, Liu, Yunxin, Ren, Ju, Zhang, Deyu, Cao, Ting
Large language models (LLMs) are increasingly being deployed on mobile devices, but the limited DRAM capacity constrains the deployable model size. This paper introduces ActiveFlow, the first LLM inference framework that can achieve adaptive DRAM usage for modern LLMs (not ReLU-based), enabling the scaling up of deployable model sizes. The framework is based on the novel concept of active weight DRAM-flash swapping and incorporates three novel techniques: (1) Cross-layer active weights preloading. It uses the activations from the current layer to predict the active weights of several subsequent layers, enabling computation and data loading to overlap, as well as facilitating large I/O transfers. (2) Sparsity-aware self-distillation. It adjusts the active weights to align with the dense-model output distribution, compensating for approximations introduced by contextual sparsity. (3) Active weight DRAM-flash swapping pipeline. It orchestrates the DRAM space allocation among the hot weight cache, preloaded active weights, and computation-involved weights based on available memory. Results show ActiveFlow achieves the performance-cost Pareto frontier compared to existing efficiency optimization methods.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Texas > Travis County > Austin (0.04)
- (4 more...)
- Research Report > Promising Solution (0.54)
- Research Report > New Finding (0.34)
Interpretable inverse design of optical multilayer thin films based on extended neural adjoint and regression activation mapping
We propose an extended neural adjoint (ENA) framework, which meets six key criteria for artificial intelligence-assisted inverse design of optical multilayer thin films (OMTs): accuracy, efficiency, diversity, scalability, flexibility, and interpretability. To enhance the scalability of the existing neural adjoint method, we present a novel forward neural network architecture for OMTs and introduce a material loss function into the existing neural adjoint loss function, facilitating the exploration of material configurations of OMTs. Furthermore, we present the detailed formulation of the regression activation mapping for the presented forward neural network architecture (F-RAM), a feature visualization method aimed at improving interpretability. We validated the efficacy of the material loss by conducting an ablation study, where each component of the loss function is systematically removed and evaluated. The results indicated that the inclusion of the material loss significantly improves accuracy and diversity. To substantiate the performance of the ENA-based inverse design, we compared it against the residual network-based global optimization network (Res-GLOnet). The ENA yielded the OMT solutions of an inverse design with higher accuracy and better diversity compared to the Res-GLOnet. To demonstrate the interpretability, we applied F-RAM to diverse OMT structures with similar optical properties, obtained by the proposed ENA method. We showed that distributions of feature importance for various OMT structures exhibiting analogous optical properties are consistent, despite variations in material configurations, layer number, and thicknesses. Furthermore, we demonstrate the flexibility of the ENA method by restricting the initial layer of OMTs to SiO2 and 100 nm.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- (3 more...)
Exploiting Edited Large Language Models as General Scientific Optimizers
Lv, Qitan, Liu, Tianyu, Wang, Hong
Large language models (LLMs) have been widely adopted in mathematical optimization in scientific scenarios for their extensive knowledge and advanced reasoning capabilities. Existing methods mainly focus on utilizing LLMs to solve optimization problems in a prompt-based manner, which takes observational feedback as additional textual descriptions. However, due to LLM's \textbf{high sensitivity to the prompts} and \textbf{tendency to get lost in lengthy prompts}, these methods struggle to effectively utilize the {observational} feedback from each optimization step, which severely hinders the applications for real-world scenarios. To address these challenges, we propose a conceptually simple and general {bi-level} optimization method, namely \textbf{G}eneral \textbf{S}cientific \textbf{O}ptimizers (GSO). Specifically, GSO first utilizes inner-level simulators as experimental platforms to evaluate the current solution and provide observational feedback. Then, LLMs serve as knowledgeable and versatile scientists, generating new solutions by refining potential errors from the feedback as the outer-level optimization. Finally, simulations together with the expert knowledge in LLMs are jointly updated with bi-level interactions via model editing. Extensive experiments show that GSO consistently outperforms existing state-of-the-art methods using \textit{six} different LLM backbones on \textit{seven} different tasks, demonstrating the effectiveness and a wide range of applications.
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
- (4 more...)
- Research Report > Promising Solution (1.00)
- Research Report > New Finding (0.68)
Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion
Saynova, Denitsa, Hagström, Lovisa, Johansson, Moa, Johansson, Richard, Kuhlmann, Marco
Previous interpretations of language models (LMs) miss important distinctions in how these models process factual information. For example, given the query "Astrid Lindgren was born in" with the corresponding completion "Sweden", no difference is made between whether the prediction was based on having the exact knowledge of the birthplace of the Swedish author or assuming that a person with a Swedish-sounding name was born in Sweden. In this paper, we investigate four different prediction scenarios for which the LM can be expected to show distinct behaviors. These scenarios correspond to different levels of model reliability and types of information being processed - some being less desirable for factual predictions. To facilitate precise interpretations of LMs for fact completion, we propose a model-specific recipe called PrISM for constructing datasets with examples of each scenario based on a set of diagnostic criteria. We apply a popular interpretability method, causal tracing (CT), to the four prediction scenarios and find that while CT produces different results for each scenario, aggregations over a set of mixed examples may only represent the results from the scenario with the strongest measured signal. In summary, we contribute tools for a more granular study of fact completion in language models and analyses that provide a more nuanced understanding of how LMs process fact-related queries.
- Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- North America > United States > Ohio (0.04)
- (17 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.75)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.54)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)
Beyond Individual Facts: Investigating Categorical Knowledge Locality of Taxonomy and Meronomy Concepts in GPT Models
Burger, Christopher, Hu, Yifan, Le, Thai
The location of knowledge within Generative Pre-trained Transformer (GPT)-like models has seen extensive recent investigation. However, much of the work is focused towards determining locations of individual facts, with the end goal being the editing of facts that are outdated, erroneous, or otherwise harmful, without the time and expense of retraining the entire model. In this work, we investigate a broader view of knowledge location, that of concepts or clusters of related information, instead of disparate individual facts. To do this, we first curate a novel dataset, called DARC, that includes a total of 34 concepts of ~120K factual statements divided into two types of hierarchical categories, namely taxonomy and meronomy. Next, we utilize existing causal mediation analysis methods developed for determining regions of importance for individual facts and apply them to a series of related categories to provide detailed investigation into whether concepts are associated with distinct regions within these models. We find that related categories exhibit similar areas of importance in contrast to less similar categories. However, fine-grained localization of individual category subsets to specific regions is not apparent.
- North America > United States > Mississippi (0.04)
- North America > United States > Indiana (0.04)
- North America > Dominican Republic (0.04)
- Asia > China > Hong Kong (0.04)
Do Llamas Work in English? On the Latent Language of Multilingual Transformers
Wendler, Chris, Veselovsky, Veniamin, Monea, Giovanni, West, Robert
We ask whether multilingual language models trained on unbalanced, English-dominated corpora use English as an internal pivot language -- a question of key importance for understanding how language models function and the origins of linguistic bias. Focusing on the Llama-2 family of transformer models, our study uses carefully constructed non-English prompts with a unique correct single-token continuation. From layer to layer, transformers gradually map an input embedding of the final prompt token to an output embedding from which next-token probabilities are computed. Tracking intermediate embeddings through their high-dimensional space reveals three distinct phases, whereby intermediate embeddings (1) start far away from output token embeddings; (2) already allow for decoding a semantically correct next token in the middle layers, but give higher probability to its version in English than in the input language; (3) finally move into an input-language-specific region of the embedding space. We cast these results into a conceptual model where the three phases operate in "input space", "concept space", and "output space", respectively. Crucially, our evidence suggests that the abstract "concept space" lies closer to English than to other languages, which may have important consequences regarding the biases held by multilingual language models.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Finland (0.04)
- (2 more...)
On the Role of Attention Masks and LayerNorm in Transformers
Wu, Xinyi, Ajorlou, Amir, Wang, Yifei, Jegelka, Stefanie, Jadbabaie, Ali
Self-attention is the key mechanism of transformers, which are the essential building blocks of modern foundation models. Recent studies have shown that pure self-attention suffers from an increasing degree of rank collapse as depth increases, limiting model expressivity and further utilization of model depth. The existing literature on rank collapse, however, has mostly overlooked other critical components in transformers that may alleviate the rank collapse issue. In this paper, we provide a general analysis of rank collapse under self-attention, taking into account the effects of attention masks and layer normalization (LayerNorm). In particular, we find that although pure masked attention still suffers from exponential collapse to a rank one subspace, local masked attention can provably slow down the collapse rate. In the case of self-attention with LayerNorm, we first show that for certain classes of value matrices, collapse to a rank one subspace still happens exponentially. However, through construction of nontrivial counterexamples, we then establish that with proper choice of value matrices, a general class of sequences may not converge to a rank one subspace, and the self-attention dynamics with LayerNorm can simultaneously possess a rich set of equilibria with any possible rank between one and full. Our result refutes the previous hypothesis that LayerNorm plays no role in the rank collapse of self-attention and suggests that self-attention with LayerNorm constitutes a much more expressive, versatile nonlinear dynamical system than what was originally thought.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Asia > Middle East > Jordan (0.04)