Goto

Collaborating Authors

Supplementary material for Dynamic Causal Bayesian Optimisation

Neural Information Processing Systems

In this section we give the proof for Theorem 1 in the main text. W. This means that W includes those variables that are parents of Y Eq. (1) follows from Y Eq. (2) follows from the Eq. Exploiting Eq. (8) we can rewrite Eq. (6) as: We can further expand Eq. (11) noticing that in this case W = {Z In this section we give the proof for Proposition 3.1 in the main text. This section contains additional experimental details associated to the experiments discussed in Section 4 of the main text. Notice how the location of the optimum changes significantly both in terms of optimal set and intervention value when going from t = 0 to t = 1.


Dataset Distillation using Neural Feature Regression

Neural Information Processing Systems

Dataset distillation aims to learn a small synthetic dataset that preserves most of the information from the original dataset. Dataset distillation can be formulated as a bi-level meta-learning problem where the outer loop optimizes the metadataset and the inner loop trains a model on the distilled data. Meta-gradient computation is one of the key challenges in this formulation, as differentiating through the inner loop learning procedure introduces significant computation and memory costs. In this paper, we address these challenges using neural Feature Regression with Pooling (FRePo), achieving the state-of-the-art performance with an order of magnitude less memory requirement and two orders of magnitude faster training than previous methods. The proposed algorithm is analogous to truncated backpropagation through time with a pool of models to alleviate various types of overfitting in dataset distillation. FRePo significantly outperforms the previous methods on CIFAR100, Tiny ImageNet, and ImageNet-1K. Furthermore, we show that high-quality distilled data can greatly improve various downstream applications, such as continual learning and membership inference defense. Please check out our webpage at https://sites.google.com/view/frepo.



Self-Supervised Few-Shot Learning on Point Clouds

Neural Information Processing Systems

Visualization of ball covers The cover-tree approach of using the balls to group the points in a point cloud is visualized in Figure 1. The visualization shows the process of considering balls shown as transparent spheres at different scales with different densities in a cover-tree. Fig 1a represents the top level (root) of cover-tree which covers the point cloud in a single ball i.e., at level i. Fig 1b and Fig 1c shows the balls at lower level with smaller radiuses as the tree is descended. Thus, we learn local features using balls at various levels with different packing densities. A.1 3D Object Classification Training This section provides the implementation details of our proposed self-supervised network.


Multiview Human Body Reconstruction from Uncalibrated Cameras

Neural Information Processing Systems

We present a new method to reconstruct 3D human body pose and shape by fusing visual features from multiview images captured by uncalibrated cameras. Existing multiview approaches often use spatial camera calibration (intrinsic and extrinsic parameters) to geometrically align and fuse visual features. Despite remarkable performances, the requirement of camera calibration restricted their applicability to real-world scenarios, e.g., reconstruction from social videos with wide-baseline cameras. We address this challenge by leveraging the commonly observed human body as a semantic calibration target, which eliminates the requirement of camera calibration. Specifically, we map per-pixel image features to a canonical body surface coordinate system agnostic to views and poses using dense keypoints (correspondences). This feature mapping allows us to semantically, instead of geometrically, align and fuse visual features from multiview images. We learn a self-attention mechanism to reason about the confidence of visual features across and within views. With fused visual features, a regressor is learned to predict the parameters of a body model. We demonstrate that our calibration-free multiview fusion method reliably reconstructs 3D body pose and shape, outperforming stateof-the-art single view methods with post-hoc multiview fusion, particularly in the presence of non-trivial occlusion, and showing comparable accuracy to multiview methods that require calibration.


42cd63cb189c30ed03e42ce2c069566c-AuthorFeedback.pdf

Neural Information Processing Systems

We sincerely thank all reviewers for their constructive comments. We hope this would shed some light on a better understanding of parameter sharing in NAS. We sincerely appreciate your recognition of our technical contributions. (Line 181). Meanwhile, as you pointed out, different optimization of APS would be interesting to explore in the future.


DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

Neural Information Processing Systems

Solving mathematical problems requires advanced reasoning abilities and presents notable challenges for large language models. Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries. Hypothesizing that difficult queries are crucial to learning complex reasoning, we propose Difficulty-Aware Rejection Tuning (DART), a method that allocates difficult queries more trials during the synthesis phase, enabling more extensive training on difficult samples. Utilizing DART, we have created new datasets for mathematical problem-solving that focus more on difficult queries and are substantially smaller than previous ones. Remarkably, our synthesis process solely relies on a 7B-sized open-weight model, without reliance on the commonly used proprietary GPT-4. We fine-tune various base models on our datasets ranging from 7B to 70B in size, resulting in a series of strong models called DART-Math. In comprehensive in-domain and out-of-domain evaluation on 6 mathematical benchmarks, DART-Math outperforms vanilla rejection tuning significantly, being superior or comparable to previous arts, despite using much smaller datasets and no proprietary models.


A Spatial Conditioning Without Bubble Artifacts and spatial conditioning'style's s,s b 2 R

Neural Information Processing Systems

Let us begin by recalling how SPADE works, and study where its defects come from. These statistics are calculated via averages over examples and all spatial dimensions. To clarify, the subtraction and division in (3) are broadcasted on non-channel dimensions, and the pointwise multiplication and addition are broadcasted over examples. SPADE layers are remarkably similar to the Adaptive Instance Normalization (AdaIN) layers that are used in StyleGAN to condition on z. Finally, the conditioning of the generator's output y = g(z) (StyleGAN is an unconditional generative model) is done via AdaIN layers conditioned on s(z).


Approximate Gaussian process inference for the drift function in stochastic differential equations

Neural Information Processing Systems

We introduce a nonparametric approach for estimating drift functions in systems of stochastic differential equations from sparse observations of the state vector. Using a Gaussian process prior over the drift as a function of the state vector, we develop an approximate EM algorithm to deal with the unobserved, latent dynamics between observations. The posterior over states is approximated by a piecewise linearized process of the Ornstein-Uhlenbeck type and the MAP estimation of the drift is facilitated by a sparse Gaussian process regression.


From Stochastic Mixability to Fast Rates

Neural Information Processing Systems

Empirical risk minimization (ERM) is a fundamental learning rule for statistical learning problems where the data is generated according to some unknown distribution P and returns a hypothesis f chosen from a fixed class F with small loss l. In the parametric setting, depending upon (l, F, P) ERM can have slow (1/ n) or fast (1/n) rates of convergence of the excess risk as a function of the sample size n. There exist several results that give sufficient conditions for fast rates in terms of joint properties of l, F, and P, such as the margin condition and the Bernstein condition. In the non-statistical prediction with expert advice setting, there is an analogous slow and fast rate phenomenon, and it is entirely characterized in terms of the mixability of the loss l (there being no role there for F or P). The notion of stochastic mixability builds a bridge between these two models of learning, reducing to classical mixability in a special case. The present paper presents a direct proof of fast rates for ERM in terms of stochastic mixability of (l, F, P), and in so doing provides new insight into the fast-rates phenomenon.