Problem-Independent Architectures
Review for NeurIPS paper: Adapting Neural Architectures Between Domains
This paper focuses on the intersection between neural architecture search and domain adaptation. The proposal is to minimize the cross-domain generalization gap that generally exists in current neural architecture search (NAS) methods with proxy tasks. The philosophy behind sounds quite interesting to me. Namely, instead of directly using the target dataset for searching, which suffers from the high computation cost, the authors propose to improve the generalizability of neural architectures by leveraging a small portion of target samples via a domain adaptation technique. This philosophy leads to a novel algorithm design I have never seen, i.e., AdaptNAS.
Reviews: XNAS: Neural Architecture Search with Expert Advice
I think the empirical evidence for exponentiated gradient descent ( wipeout) in NAS is valuable for the community. I would suggest that the authors clearly state the limitations associated with the analysis.] Summary: The proposed approach for NAS treats architecture choices, i.e., intra-cell node connections as in DARTS (Liu et al., 2018), as selection among "experts". The expert weights are found via EG descent (Kivinen & Warmuth, 1997) that somehow utilizes the standard "back-propagated loss gradient" (line 113) to perform the multiplicative weight updates. It's at this point that I had trouble following the theoretical analysis given that the EG algorithm requires a loss function on the expert prediction to be specified for the development of a regret bound.
Reviews: XNAS: Neural Architecture Search with Expert Advice
The reviewers appreciated the good empirical results and theoretical analysis. This paper proposes to treat NAS problems as a selection problem among experts. Over time, it eliminates underperforming experts with a wipe-out step. As two of the reviewers pointed out, the theoretical analysis is interesting (and rare, in this type of paper), but it would be good to more explicitly spell out when and why the assumptions hold. Empirical performance seems good, but the authors should include error bars for at least the CIFAR-10 experiments and ideally the ImageNet ones as well.
Decomposing Interventional Causality into Synergistic, Redundant, and Unique Components
We introduce a novel framework for decomposing interventional causal effects into synergistic, redundant, and unique components, building on the intuition of Partial Information Decomposition (PID) and the principle of M\"obius inversion. While recent work has explored a similar decomposition of an observational measure, we argue that a proper causal decomposition must be interventional in nature. We develop a mathematical approach that systematically quantifies how causal power is distributed among variables in a system, using a recently derived closed-form expression for the M\"obius function of the redundancy lattice. The formalism is then illustrated by decomposing the causal power in logic gates, cellular automata, and chemical reaction networks. Our results reveal how the distribution of causal power can be context- and parameter-dependent. This decomposition provides new insights into complex systems by revealing how causal influences are shared and combined among multiple variables, with potential applications ranging from attribution of responsibility in legal or AI systems, to the analysis of biological networks or climate models.
Arbitrarily Scalable Environment Generators via Neural Cellular Automata
We study the problem of generating arbitrarily large environments to improve the throughput of multi-robot systems. Prior work proposes Quality Diversity (QD) algorithms as an effective method for optimizing the environments of automated warehouses. However, these approaches optimize only relatively small environments, falling short when it comes to replicating real-world warehouse sizes. The challenge arises from the exponential increase in the search space as the environment size increases. Additionally, the previous methods have only been tested with up to 350 robots in simulations, while practical warehouses could host thousands of robots.
How Powerful are Performance Predictors in Neural Architecture Search?
Early methods in the rapidly developing field of neural architecture search (NAS) required fully training thousands of neural networks. To reduce this extreme computational cost, dozens of techniques have since been proposed to predict the final performance of neural architectures. Despite the success of such performance prediction methods, it is not well-understood how different families of techniques compare to one another, due to the lack of an agreed-upon evaluation metric and optimization for different constraints on the initialization time and query time. In this work, we give the first large-scale study of performance predictors by analyzing 31 techniques ranging from learning curve extrapolation, to weight-sharing, to supervised learning, to zero-cost proxies. We test a number of correlation- and rank-based performance measures in a variety of settings, as well as the ability of each technique to speed up predictor-based NAS frameworks.
Unifying and Boosting Gradient-Based Training-Free Neural Architecture Search
Neural architecture search (NAS) has gained immense popularity owing to its ability to automate neural architecture design. A number of training-free metrics are recently proposed to realize NAS without training, hence making NAS more scalable. Despite their competitive empirical performances, a unified theoretical understanding of these training-free metrics is lacking. As a consequence, (a) the relationships among these metrics are unclear, (b) there is no theoretical interpretation for their empirical performances, and (c) there may exist untapped potential in existing training-free NAS, which probably can be unveiled through a unified theoretical understanding. To this end, this paper presents a unified theoretical analysis of gradient-based training-free NAS, which allows us to (a) theoretically study their relationships, (b) theoretically guarantee their generalization performances, and (c) exploit our unified theoretical understanding to develop a novel framework named hybrid NAS (HNAS) which consistently boosts training-free NAS in a principled way.
BLOX: Macro Neural Architecture Search Benchmark and Algorithms
Neural architecture search (NAS) has been successfully used to design numerous high-performance neural networks. However, NAS is typically compute-intensive, so most existing approaches restrict the search to decide the operations and topological structure of a single block only, then the same block is stacked repeatedly to form an end-to-end model. Although such an approach reduces the size of search space, recent studies show that a macro search space, which allows blocks in a model to be different, can lead to better performance. To provide a systematic study of the performance of NAS algorithms on a macro search space, we release Blox – a benchmark that consists of 91k unique models trained on the CIFAR-100 dataset. The dataset also includes runtime measurements of all the models on a diverse set of hardware platforms. We perform extensive experiments to compare existing algorithms that are well studied on cell-based search spaces, with the emerging blockwise approaches that aim to make NAS scalable to much larger macro search spaces.
Learning Graph Cellular Automata
Cellular automata (CA) are a class of computational models that exhibit rich dynamics emerging from the local interaction of cells arranged in a regular lattice. In this work we focus on a generalised version of typical CA, called graph cellular automata (GCA), in which the lattice structure is replaced by an arbitrary graph. In particular, we extend previous work that used convolutional neural networks to learn the transition rule of conventional CA and we use graph neural networks to learn a variety of transition rules for GCA. First, we present a general-purpose architecture for learning GCA, and we show that it can represent any arbitrary GCA with finite and discrete state space. Then, we test our approach on three different tasks: 1) learning the transition rule of a GCA on a Voronoi tessellation; 2) imitating the behaviour of a group of flocking agents; 3) learning a rule that converges to a desired target state.
Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models
Traditional knowledge distillation (KD) methods manually design student architectures to compress large models given pre-specified computational cost. This requires several trials to find viable students, and repeating the process with change in computational budget. We use Neural Architecture Search (NAS) to automatically distill several compressed students with variable cost from a large model. Existing NAS methods train a single SuperLM consisting of millions of subnetworks with weight-sharing, resulting in interference between subnetworks of different sizes. Additionally, many of these works are task-specific requiring task labels for SuperLM training.