Goto

Collaborating Authors

 artificial intelligence


Should We Learn Most Likely Functions or Parameters? Tim G. J. Rudner

Neural Information Processing Systems

Standard regularized training procedures correspond to maximizing a posterior distribution over parameters, known as maximum a posteriori (MAP) estimation. However, model parameters are of interest only insomuch as they combine with the functional form of a model to provide a function that can make good predictions. Moreover, the most likely parameters under the parameter posterior do not generally correspond to the most likely function induced by the parameter posterior. In fact, we can re-parametrize a model such that any setting of parameters can maximize the parameter posterior. As an alternative, we investigate the benefits and drawbacks of directly estimating the most likely function implied by the model and the data. We show that this procedure leads to pathological solutions when using neural networks and prove conditions under which the procedure is well-behaved, as well as a scalable approximation. Under these conditions, we find that function-space MAP estimation can lead to flatter minima, better generalization, and improved robustness to overfitting.


Chain of Thought Imitation with Procedure Cloning

Neural Information Processing Systems

Imitation learning aims to extract high-performance policies from logged demonstrations of expert behavior. It is common to frame imitation learning as a supervised learning problem in which one fits a function approximator to the input-output mapping exhibited by the logged demonstrations (input observations to output actions). While the framing of imitation learning as a supervised input-output learning problem allows for applicability in a wide variety of settings, it is also an overly simplistic view of the problem in situations where the expert demonstrations provide much richer insight into expert behavior. For example, applications such as path navigation, robot manipulation, and strategy games acquire expert demonstrations via planning, search, or some other multi-step algorithm, revealing not just the output action to be imitated but also the procedure for how to determine this action. While these intermediate computations may use tools not available to the agent during inference (e.g., environment simulators), they are nevertheless informative as a way to explain an expert's mapping of state to actions. To properly leverage expert procedure information without relying on the privileged tools the expert may have used to perform the procedure, we propose procedure cloning, which applies supervised sequence prediction to imitate the series of expert computations. This way, procedure cloning learns not only what to do (i.e., the output action), but how and why to do it (i.e., the procedure). Through empirical analysis on navigation, simulated robotic manipulation, and game-playing environments, we show that imitating the intermediate computations of an expert's behavior enables procedure cloning to learn policies exhibiting significant generalization to unseen environment configurations, including those configurations for which running the expert's procedure directly is infeasible.


Appendix

Neural Information Processing Systems

Format is the same as Figure 2c&d. The peak correlation vs. segment duration curve tended to approach an asymptotic value at long segment durations (see Figure 2d). For simplicity, we estimated this asymptotic value for each unit by measuring the peak cross-context correlation across lag for the longest segment duration tested (2.48 seconds) (i.e., the rightmost values in the curves shown in Figure 2d). Convolutional layers have a maximum value of 1, as expected since they have a well-defined upper bound on their integration window. The LSTM layers also showed high maximum values (median correlation value across units was above 0.93 for all layers), indicating a mostly context-invariant response.


Understanding Adaptive, Multiscale Temporal Integration In Deep Speech Recognition Systems Sam V. Norman-Haignere

Neural Information Processing Systems

Natural signals such as speech are hierarchically structured across many different timescales, spanning tens (e.g., phonemes) to hundreds (e.g., words) of milliseconds, each of which is highly variable and context-dependent. While deep neural networks (DNNs) excel at recognizing complex patterns from natural signals, relatively little is known about how DNNs flexibly integrate across multiple timescales. Here, we show how a recently developed method for studying temporal integration in biological neural systems - the temporal context invariance (TCI) paradigm - can be used to understand temporal integration in DNNs. The method is simple: we measure responses to a large number of stimulus segments presented in two different contexts and estimate the smallest segment duration needed to achieve a context invariant response. We applied our method to understand how the popular DeepSpeech2 model learns to integrate across time in speech.


Appendices A

Neural Information Processing Systems

In this appendix we present the general version of Definition 3 allowing harm and benefit to be measured along specific causal paths. The path-specific counterfactual harm measures the harm caused by an action A = a compared to a default action A =ฤ when, rather than generating the counterfactual outcome by including all causal paths from A =ฤ to outcome variables Y, we consider only the effect along certain paths g. This is somewhat analogous to the path specific causal effect [81], as we are using the g-specific intervention A =ฤ on Y in the counterfactual world relative to reference A = a (the factual action). Let G be the DAG associated with model M and g be the edge sub-graph of G containing the paths we include in the harm analysis. E = e is the joint state of the exogenous noise variables in M. Likewise, the expected benefit is Z b However, in (12) and (14) we condition on the state of all factual variables and assume no unobserved confounders, and the reference action is the factual action state. In these examples we typically focus on causal models with an action A, an outcome Y and another mediating outcome Z s.t. We refer to the path-specific harm where we restrict to A! Y as the'direct harm', the path specific harm where we restrict to A! Z! Y as the'indirect harm', and the'total harm' when we do not exclude any causal path. In this appendix we discuss the omission problem and pre-emption problem [43], and the preventing worse problem [82], and show how these can be resolved using our definition of counterfactual harm (Definition 3 and its path-specific variant Definition 9). We also discuss some alternative definitions of harm. Omission Problem: Alice decides not to give Bob a set of golf clubs. Bob would be happy if Alice had given him the golf clubs. Therefore, according to the CCA, Alice's decision not to give Bob the clubs causes Bob harm. However, intuitively Alice has not harmed Bob, but merely failed to benefit him [43]. In our definition of harm, this implies the obvious default action be that Alice not giving Bob clubs by default, i.e. the desired harm query is the harm caused by Alice's action compared to baseline where Alice does not give Bob club.


Counterfactual harm Rory Beard DeepMind

Neural Information Processing Systems

To act safely and ethically in the real world, agents must be able to reason about harm and avoid harmful actions. However, to date there is no statistical method for measuring harm and factoring it into algorithmic decisions. In this paper we propose the first formal definition of harm and benefit using causal models. We show that any factual definition of harm is incapable of identifying harmful actions in certain scenarios, and show that standard machine learning algorithms that cannot perform counterfactual reasoning are guaranteed to pursue harmful policies following distributional shifts. We use our definition of harm to devise a framework for harm-averse decision making using counterfactual objective functions. We demonstrate this framework on the problem of identifying optimal drug doses using a dose-response model learned from randomized control trial data. We find that the standard method of selecting doses using treatment effects results in unnecessarily harmful doses, while our counterfactual approach identifies doses that are significantly less harmful without sacrificing efficacy.


A Implementation Details

Neural Information Processing Systems

A.1 Zero123-XL A batch size of 2048 is used during training with a learning rate of 1e-4. Different from the original paper [15], we performed a second-stage finetuning with a smaller learning rate of 5e-5 on a highquality subset of Objaverse-XL selected with dataset metadata. The first stage was trained for 375K iterations and the second stage is trained for 65K iterations. For dataset scaling experiment whose results are shown in 6, datasets with size below 800K are randomly sampled subsets from Objaverse 1.0. We keep the rest of the setting consistent with the original paper [15]. Training the Zero123-XL model was done on 256 NVIDIA A100s over the course of 4 days, for a total of around 25K GPU hours. Rendering the objects was done for around 1 week on 48 NVIDIA T4 GPUs, totaling around 8K GPU hours. Both training and rendering were conducted using AWS. We trained the PixelNeRF models on 8 NVIDIA A100's over the course of 2 days with a total batch size of 256 for 200 epochs. We used a constant learning rate of 1e-4. We added gradient clipping to the original implementation as we found the loss would destabilize throughout training otherwise. We trained the model with 1 input view for the experiments reported in the main paper.


Objaverse-XL: A Universe of 10M+ 3D Objects

Neural Information Processing Systems

Natural language processing and 2D vision models have attained remarkable proficiency on many tasks primarily by escalating the scale of training data. However, 3D vision tasks have not seen the same progress, in part due to the challenges of acquiring high-quality 3D data.


Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs Dongruo Zhou

Neural Information Processing Systems

Recent studies have shown that episodic reinforcement learning (RL) is not more difficult than contextual bandits, even with a long planning horizon and unknown state transitions. However, these results are limited to either tabular Markov decision processes (MDPs) or computationally inefficient algorithms for linear mixture MDPs.


Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs Dongruo Zhou

Neural Information Processing Systems

Recent studies have shown that episodic reinforcement learning (RL) is not more difficult than contextual bandits, even with a long planning horizon and unknown state transitions. However, these results are limited to either tabular Markov decision processes (MDPs) or computationally inefficient algorithms for linear mixture MDPs.