theta
b125999bde7e80910cbdbd323087df8f-Supplemental-Conference.pdf
Foreachprompt, wecompare 6 pairs of models: Quark versus other baselines, as shown in Table 2. These agreement scores are moderate as result of subjectivity involved in ratings of text quality. PPLM (Plug and Play Language Model) uses one or more classifiers to control attributes of model generations. Figure 8: Screenshot of the mechanical turk interfaced used to gather human judgments for the sentimentevaluation. Unlikelihood represents a GPT-2 model fine-tuned with unlikelihoodobjective(Eqn.5)[79].
A Trichotomy for Transductive Online Learning
This setting is similar to standard online learning, except that the adversary fixes a sequence of instances $x_1,\dots,x_n$ to be labeled at the start of the game, and this sequence is known to the learner. Qualitatively, we prove a \emph{trichotomy}, stating that the minimal number of mistakes made by the learner as $n$ grows can take only one of precisely three possible values: $n$, $\Theta\left(\log (n)\right)$, or $\Theta(1)$. Furthermore, this behavior is determined by a combination of the VC dimension and the Littlestone dimension. Quantitatively, we show a variety of bounds relating the number of mistakes to well-known combinatorial dimensions. In particular, we improve the known lower bound on the constant in the $\Theta(1)$ case from $\Omega\left(\sqrt{\log(d)}\right)$ to $\Omega(\log(d))$ where $d$ is the Littlestone dimension. Finally, we extend our results to cover multiclass classification and the agnostic setting.
The Semi-Random Satisfaction of Voting Axioms
We initiate the work towards a comprehensive picture of the worst average-case satisfaction of voting axioms in semi-random models, to provide a finer and more realistic foundation for comparing voting rules. We adopt the semi-random model and formulation in [Xia 2020], where an adversary chooses arbitrarily correlated ``ground truth'' preferences for the agents, on top of which random noises are added. We focus on characterizing the semi-random satisfaction of two well-studied voting axioms: Condorcet criterion and participation. We prove that for any fixed number of alternatives, when the number of voters $n$ is sufficiently large, the semi-random satisfaction of the Condorcet criterion under a wide range of voting rules is $1$, $1-\exp(-\Theta(n))$, $\Theta(n^{-0.5})$,
Stability and Deviation Optimal Risk Bounds with Convergence Rate O(1/n)
The sharpest known high probability generalization bounds for uniformly stable algorithms (Feldman, Vondrak, NeurIPS 2018, COLT, 2019), (Bousquet, Klochkov, Zhivotovskiy, COLT, 2020) contain a generally inevitable sampling error term of order $\Theta(1/\sqrt{n})$. When applied to excess risk bounds, this leads to suboptimal results in several standard stochastic convex optimization problems. We show that if the so-called Bernstein condition is satisfied, the term $\Theta(1/\sqrt{n})$ can be avoided, and high probability excess risk bounds of order up to $O(1/n)$ are possible via uniform stability. Using this result, we show a high probability excess risk bound with the rate $O(\log n/n)$ for strongly convex and Lipschitz losses valid for \emph{any} empirical risk minimization method.
Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space
Models of many real-life applications, such as queueing models of communication networks or computing systems, have a countably infinite state-space. Algorithmic and learning procedures that have been developed to produce optimal policies mainly focus on finite state settings, and do not directly apply to these models. To overcome this lacuna, in this work we study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes (MDPs) governed by an unknown parameter $\theta\in\Theta$, and defined on a countably-infinite state-space $\mathcal X=\mathbb{Z}_+^d$, with finite action space $\mathcal A$, and an unbounded cost function. We take a Bayesian perspective with the random unknown parameter $\boldsymbol{\theta}^*$ generated via a given fixed prior distribution on $\Theta$. To optimally control the unknown MDP, we propose an algorithm based on Thompson sampling with dynamically-sized episodes: at the beginning of each episode, the posterior distribution formed via Bayes' rule is used to produce a parameter estimate, which then decides the policy applied during the episode. To ensure the stability of the Markov chain obtained by following the policy chosen for each parameter, we impose ergodicity assumptions. From this condition and using the solution of the average cost Bellman equation, we establish an $\tilde O(dh^d\sqrt{|\mathcal A|T})$ upper bound on the Bayesian regret of our algorithm, where $T$ is the time-horizon. Finally, to elucidate the applicability of our algorithm, we consider two different queueing models with unknown dynamics, and show that our algorithm can be applied to develop approximately optimal control algorithms.
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.61)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.39)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.39)
Beyond the Birkhoff Polytope: Convex Relaxations for Vector Permutation Problems
The Birkhoff polytope (the convex hull of the set of permutation matrices), which is represented using $\Theta(n^2)$ variables and constraints, is frequently invoked in formulating relaxations of optimization problems over permutations. Using a recent construction of Goemans (2010), we show that when optimizing over the convex hull of the permutation vectors (the permutahedron), we can reduce the number of variables and constraints to $\Theta(n \log n)$ in theory and $\Theta(n \log^2 n)$ in practice. We modify the recent convex formulation of the 2-SUM problem introduced by Fogel et al. (2013) to use this polytope, and demonstrate how we can attain results of similar quality in significantly less computational time for large $n$. To our knowledge, this is the first usage of Goemans' compact formulation of the permutahedron in a convex optimization problem. We also introduce a simpler regularization scheme for this convex formulation of the 2-SUM problem that yields good empirical results.
A Near-optimal Algorithm for Learning Margin Halfspaces with Massart Noise
We study the problem of PAC learning \gamma -margin halfspaces in the presence of Massart noise. Without computational considerations, the sample complexity of this learning problem is known to be \widetilde{\Theta}(1/(\gamma 2 \epsilon)) . Prior computationally efficient algorithms for the problem incur sample complexity \tilde{O}(1/(\gamma 4 \epsilon 3)) and achieve 0-1 error of \eta \epsilon, where \eta 1/2 is the upper bound on the noise rate.Recent work gave evidence of an information-computation tradeoff, suggesting that a quadratic dependence on 1/\epsilon is required for computationally efficient algorithms. Our main result is a computationally efficient learner with sample complexity \widetilde{\Theta}(1/(\gamma 2 \epsilon 2)), nearly matching this lower bound. In addition, our algorithm is simple and practical, relying on online SGD on a carefully selected sequence of convex losses.
Reviews: On Neuronal Capacity
The main result of the paper is giving tight bounds on the number of n-variable boolean functions that can be expressed as degree d-polynomial threshold functions for any fixed d (d growing very mildly with n) also works. To me the rest of the results while interesting seem to be mostly easy applications of the key result to other more fashionable neural network models. However, the tightness of the main result does not translate to tight bounds when other neuroidal models are considered, because of the kind of non-linearities or weight constraints involved. The main result is highly non-trivial, the proof quite lengthy though elegant, and resolves a 25 year old open problem. Although the proof uses a lot of heavy mathematics, the key contribution seems to be generalizing random matrix theory to a random tensor theory -- the key result being that a large number of stochastically independent random vectors in low-dimension (and hence clearly not linearly indpendent) still yield a high degree of linear independence when tensorized.
Experts Don't Cheat: Learning What You Don't Know By Predicting Pairs
Johnson, Daniel D., Tarlow, Daniel, Duvenaud, David, Maddison, Chris J.
Identifying how much a model ${\widehat{p}}_{\theta}(Y|X)$ knows about the stochastic real-world process $p(Y|X)$ it was trained on is important to ensure it avoids producing incorrect or "hallucinated" answers or taking unsafe actions. But this is difficult for generative models because probabilistic predictions do not distinguish between per-response noise (aleatoric uncertainty) and lack of knowledge about the process (epistemic uncertainty), and existing epistemic uncertainty quantification techniques tend to be overconfident when the model underfits. We propose a general strategy for teaching a model to both approximate $p(Y|X)$ and also estimate the remaining gaps between ${\widehat{p}}_{\theta}(Y|X)$ and $p(Y|X)$: train it to predict pairs of independent responses drawn from the true conditional distribution, allow it to "cheat" by observing one response while predicting the other, then measure how much it cheats. Remarkably, we prove that being good at cheating (i.e. cheating whenever it improves your prediction) is equivalent to being second-order calibrated, a principled extension of ordinary calibration that allows us to construct provably-correct frequentist confidence intervals for $p(Y|X)$ and detect incorrect responses with high probability. We demonstrate empirically that our approach accurately estimates how much models don't know across ambiguous image classification, (synthetic) language modeling, and partially-observable navigation tasks, outperforming existing techniques.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Portugal > Porto > Porto (0.04)
- Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)