Falahatgar, Moein
The power of absolute discounting: all-dimensional distribution estimation
Falahatgar, Moein, Ohannessian, Mesrob I., Orlitsky, Alon, Pichapati, Venkatadheeraj
Categorical models are a natural fit for many problems. When learning the distribution of categories from samples, high-dimensionality may dilute the data. Minimax optimality is too pessimistic to remedy this issue. A serendipitously discovered estimator, absolute discounting, corrects empirical frequencies by subtracting a constant from observed categories, which it then redistributes among the unobserved. It outperforms classical estimators empirically, and has been used extensively in natural language modeling.
Maxing and Ranking with Few Assumptions
Falahatgar, Moein, Hao, Yi, Orlitsky, Alon, Pichapati, Venkatadheeraj, Ravindrakumar, Vaishakh
PAC maximum selection (maxing) and ranking of $n$ elements via random pairwise comparisons have diverse applications and have been studied under many models and assumptions. With just one simple natural assumption: strong stochastic transitivity, we show that maxing can be performed with linearly many comparisons yet ranking requires quadratically many. With no assumptions at all, we show that for the Borda-score metric, maximum selection can be performed with linearly many comparisons and ranking can be performed with $\mathcal{O}(n\log n)$ comparisons.
The power of absolute discounting: all-dimensional distribution estimation
Falahatgar, Moein, Ohannessian, Mesrob I., Orlitsky, Alon, Pichapati, Venkatadheeraj
Categorical models are a natural fit for many problems. When learning the distribution ofcategories from samples, high-dimensionality may dilute the data. Minimax optimality is too pessimistic to remedy this issue. A serendipitously discovered estimator, absolute discounting, corrects empirical frequencies by subtracting aconstant from observed categories, which it then redistributes among the unobserved. It outperforms classical estimators empirically, and has been used extensively innatural language modeling. In this paper, we rigorously explain the prowess of this estimator using less pessimistic notions. We show that (1) absolute discountingrecovers classical minimax KL-risk rates, (2) it is adaptive to an effective dimension rather than the true dimension, (3) it is strongly related to the Good-Turing estimator and inherits its competitive properties. We use powerlaw distributionsas the cornerstone of these results.
Near-Optimal Smoothing of Structured Conditional Probability Matrices
Falahatgar, Moein, Ohannessian, Mesrob I., Orlitsky, Alon
Utilizing the structure of a probabilistic model can significantly increase its learning speed. Motivated by several recent applications, in particular bigram models in language processing, we consider learning low-rank conditional probability matrices under expected KL-risk. This choice makes smoothing, that is the careful handling of low-probability elements, paramount. We derive an iterative algorithm that extends classical non-negative matrix factorization to naturally incorporate additive smoothing and prove that it converges to the stationary points of a penalized empirical risk. We then derive sample-complexity bounds for the global minimizer of the penalized risk and show that it is within a small factor of the optimal sample complexity. This framework generalizes to more sophisticated smoothing techniques, including absolute-discounting.