Goto

Collaborating Authors

 logz


Information from coincidences

arXiv.org Machine Learning

We prove a single algebraic mixed coincidence identity that unifies a broad swath of information-theoretic variational results. For any family of priors $\{ฯ€_i\}$ and real exponents $\{ ฮฑ_i \}$, the log of the mixed count $E_{x\simฮฝ}\!\left[\prod_{i=1}^W ฯ€_i^{ฮฑ_i}(x)\right]$ is simultaneously a Boltzmann coincidence weight, an exponential-family normalizer, a maximum-entropy value, and a KL-barycenter optimum. The identity yields a unified derivation of classical cornerstones of information theory: concentration of empirical distributions (Sanov-type decompositions and Gibbs conditioning), hypothesis-testing error exponents (Chernoff information and its multi-way analogue), change-of-measure inequalities (Donsker-Varadhan and PAC-Bayes), and laws governing rare-pattern coincidences (Erdos-Renyi run-length, iterative guesswork, rate-distortion, and birthday thresholds). Each is recovered as a specialization of the same algebraic equality. It strictly generalizes the classical Renyi entropy and divergence variational formulas (one and two priors respectively) to a $W$-prior simplex, and holds for unnormalized and continuum-indexed priors. Among its consequences are an exact multi-prior PAC-Bayes penalty that subtracts an explicit "coincidence bonus" from the usual single-prior posterior penalty, and the asymptotic MAP error exponent for $W$-ary hypothesis testing as an edge-restricted simplex optimum. We demonstrate the calculus at scale on two large alphabets encoding richly modeled sequential languages: on language-model next-token predictives where we recover contrastive decoding, and on human genomic regulatory sequence where it separates correlated from diverse prior families along a sliding-window trace.


Near-Optimality of Contrastive Divergence Algorithms

Neural Information Processing Systems

We perform a non-asymptotic analysis of the contrastive divergence (CD) algorithm, a training method for unnormalized models. While prior work has established that (for exponential family distributions) the CD iterates asymptotically converge at an O(n 1/3) rate to the true parameter of the data distribution, we show, under some regularity assumptions, that CD can achieve the parametric rate O(n 1/2). Our analysis provides results for various data batching schemes, including the fully online and minibatch ones. We additionally show that CD can be near-optimal, in the sense that its asymptotic variance is close to the Cramรฉr-Rao lower bound.


Sparse Variational Inference: Bayesian Coresets from Scratch

Neural Information Processing Systems

Thisperspectiveleadstoanovel construction via greedy optimization, and also provides a unifying informationgeometric viewofpresent andpastmethods. TheproposedRiemannian coreset construction algorithm is fully automated, requiring no problem-specific inputs aside from theprobabilistic model and dataset.




Near-Optimality of Contrastive Divergence Algorithms

arXiv.org Machine Learning

We perform a non-asymptotic analysis of the contrastive divergence (CD) algorithm, a training method for unnormalized models. While prior work has established that (for exponential family distributions) the CD iterates asymptotically converge at an $O(n^{-1 / 3})$ rate to the true parameter of the data distribution, we show, under some regularity assumptions, that CD can achieve the parametric rate $O(n^{-1 / 2})$. Our analysis provides results for various data batching schemes, including the fully online and minibatch ones. We additionally show that CD can be near-optimal, in the sense that its asymptotic variance is close to the Cramรฉr-Rao lower bound.


Dynamic Importance Sampling for Anytime Bounds of the Partition Function

Neural Information Processing Systems

Computing the partition function is a key inference task in many graphical models. In this paper, we propose a dynamic importance sampling scheme that provides anytime finite-sample bounds for the partition function. Our algorithm balances the advantages of the three major inference strategies, heuristic search, variational bounds, and Monte Carlo methods, blending sampling with search to refine a variationally defined proposal. Our algorithm combines and generalizes recent work on anytime search [16] and probabilistic bounds [15] of the partition function. By using an intelligently chosen weighted average over the samples, we construct an unbiased estimator of the partition function with strong finite-sample confidence intervals that inherit both the rapid early improvement rate of sampling and the long-term benefits of an improved proposal from search. This gives significantly improved anytime behavior, and more flexible trade-offs between memory, time, and solution quality. We demonstrate the effectiveness of our approach empirically on real-world problem instances taken from recent UAI competitions.


Softmax Attention with Constant Cost per Token

arXiv.org Artificial Intelligence

We propose a simple modification to the conventional attention mechanism applied by Transformers: Instead of quantifying pairwise query-key similarity with scaled dot-products, we quantify it with the logarithms of scaled dot-products of exponentials. Our modification linearizes attention with exponential kernel feature maps, whose corresponding feature function is infinite dimensional. We show that our modification is expressible as a composition of log-sums of exponentials, with a latent space of constant size, enabling application with constant time and space complexity per token. We implement our modification, verify that it works in practice, and conclude that it is a promising alternative to conventional attention.


Top 10 Emerging Artificial Intelligence Startups in Israel

#artificialintelligence

Artificial intelligence (AI) has become ubiquitous across the industry verticals. From boardroom discussion to a trending topic in news, artificial intelligence has managed to capture the attention of every tech enthusiast worldwide. With organizations cashing the benefits of its application, this tech discipline has managed to live up to its hype. While the tech war between USA, China, European Union and other prominent nations escalates, Israel too aims to lead the race. Some surveys have found that Israel ranks among the top 5 countries in the world for AI solutions.


Finding the Bug in the Haystack with Machine Learning: Logz.io Exceptions in Kibana

#artificialintelligence

Logz.io is releasing its AI-powered Exceptions, a revamped version of our Application Insights, fully embedded in your Kibana Discover experience, to boost your troubleshooting experience and help you find bugs in the log haystack. How many of your production issues stem from bugs in code you deployed? The introduction of agile software methodology and its release early, release often mentality has exacerbated the problem, with more frequent code releases, in earlier stages. How do you hunt down these bugs in production? How do you ensure that your deployed code hasn't caused any issues?