Goto

Collaborating Authors

 itemize


Entropy Rate Estimation for Markov Chains with Large State Space

Neural Information Processing Systems

Entropy estimation is one of the prototypical problems in distribution property testing. To consistently estimate the Shannon entropy of a distribution on $S$ elements with independent samples, the optimal sample complexity scales sublinearly with $S$ as $\Theta(\frac{S}{\log S})$ as shown by Valiant and Valiant \cite{Valiant--Valiant2011}. Extending the theory and algorithms for entropy estimation to dependent data, this paper considers the problem of estimating the entropy rate of a stationary reversible Markov chain with $S$ states from a sample path of $n$ observations.


Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals

Neural Information Processing Systems

We consider the problem of computing the best-fitting ReLU with respect to square-loss on a training set when the examples have been drawn according to a spherical Gaussian distribution (the labels can be arbitrary). Let $\opt < 1$ be the population loss of the best-fitting ReLU. We prove: \begin{itemize} \item Finding a ReLU with square-loss $\opt + \epsilon$ is as hard as the problem of learning sparse parities with noise, widely thought to be computationally intractable. This is the first hardness result for learning a ReLU with respect to Gaussian marginals, and our results imply --{\em unconditionally}-- that gradient descent cannot converge to the global minimum in polynomial time.


An Optimal Elimination Algorithm for Learning a Best Arm

Neural Information Processing Systems

We consider the classic problem of $(\epsilon,\delta)$-\texttt{PAC} learning a best arm where the goal is to identify with confidence $1-\delta$ an arm whose mean is an $\epsilon$-approximation to that of the highest mean arm in a multi-armed bandit setting. This problem is one of the most fundamental problems in statistics and learning theory, yet somewhat surprisingly its worst case sample complexity is not well understood. In this paper we propose a new approach for $(\epsilon,\delta)$-\texttt{PAC} learning a best arm. This approach leads to an algorithm whose sample complexity converges to \emph{exactly} the optimal sample complexity of $(\epsilon,\delta)$-learning the mean of $n$ arms separately and we complement this result with a conditional matching lower bound.


Entropy Rate Estimation for Markov Chains with Large State Space

Neural Information Processing Systems

Entropy estimation is one of the prototypical problems in distribution property testing. To consistently estimate the Shannon entropy of a distribution on $S$ elements with independent samples, the optimal sample complexity scales sublinearly with $S$ as $\Theta(\frac{S}{\log S})$ as shown by Valiant and Valiant \cite{Valiant--Valiant2011}. Extending the theory and algorithms for entropy estimation to dependent data, this paper considers the problem of estimating the entropy rate of a stationary reversible Markov chain with $S$ states from a sample path of $n$ observations.


Review for NeurIPS paper: Deep Smoothing of the Implied Volatility Surface

Neural Information Processing Systems

Additional Feedback: \section*{Comments} \subsection*{General comments} From a theoretical standpoint, the two novelties that the authors claim to introduce -combining a prior model with a neural network and using soft constraints for the no arbitrage conditions, are of critical importance and make this manuscript worthy of a publication to me. From a practical standpoint, the implementation details -whether regarding the loss function, the neural network inputs or the training set, could use more work before producing a useful implied volatility surface superior to the ones produced by more classical techniques. A comparison with these techniques would have been most welcome. They should at least weight those observations by: 1/The bid/ask spread (no need to fit perfectly when the market is wide, and focus on ATM options which are usually tighter than the wings) and 2/ The vega' \frac{\partial p}{\partial \sigma} of each options, which penalizes errors on options which are highly sensitive to \sigma more than ones which are less sensitive to it. Using an equivalent condition in price space might make it easier to write a more realistic version.


An Optimal Elimination Algorithm for Learning a Best Arm

Neural Information Processing Systems

We consider the classic problem of (\epsilon,\delta) -\texttt{PAC} learning a best arm where the goal is to identify with confidence 1-\delta an arm whose mean is an \epsilon -approximation to that of the highest mean arm in a multi-armed bandit setting. This problem is one of the most fundamental problems in statistics and learning theory, yet somewhat surprisingly its worst case sample complexity is not well understood. In this paper we propose a new approach for (\epsilon,\delta) -\texttt{PAC} learning a best arm. This approach leads to an algorithm whose sample complexity converges to \emph{exactly} the optimal sample complexity of (\epsilon,\delta) -learning the mean of n arms separately and we complement this result with a conditional matching lower bound. Elimination algorithms is a broad class of (\epsilon,\delta) -\texttt{PAC} best arm learning algorithms that includes many algorithms in the literature.


Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals

Neural Information Processing Systems

We consider the problem of computing the best-fitting ReLU with respect to square-loss on a training set when the examples have been drawn according to a spherical Gaussian distribution (the labels can be arbitrary). Let \opt 1 be the population loss of the best-fitting ReLU. We prove: \begin{itemize} \item Finding a ReLU with square-loss \opt \epsilon is as hard as the problem of learning sparse parities with noise, widely thought to be computationally intractable. This is the first hardness result for learning a ReLU with respect to Gaussian marginals, and our results imply --{\em unconditionally}-- that gradient descent cannot converge to the global minimum in polynomial time. The algorithm uses a novel reduction to noisy halfspace learning with respect to 0/1 loss.


Machine Learning Data Scientist at Itemize

#artificialintelligence

The Machine Learning Data Scientist will join a team of engineers innovating new techniques for applying Artificial Intelligence, Machine Learning, and data analytics techniques to transform unstructured images and files into financial data sets.


Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals

Neural Information Processing Systems

We consider the problem of computing the best-fitting ReLU with respect to square-loss on a training set when the examples have been drawn according to a spherical Gaussian distribution (the labels can be arbitrary). Let $\opt 1$ be the population loss of the best-fitting ReLU. We prove: \begin{itemize} \item Finding a ReLU with square-loss $\opt \epsilon$ is as hard as the problem of learning sparse parities with noise, widely thought to be computationally intractable. This is the first hardness result for learning a ReLU with respect to Gaussian marginals, and our results imply --{\em unconditionally}-- that gradient descent cannot converge to the global minimum in polynomial time. The algorithm uses a novel reduction to noisy halfspace learning with respect to $0/1$ loss.


Entropy Rate Estimation for Markov Chains with Large State Space

Neural Information Processing Systems

Entropy estimation is one of the prototypical problems in distribution property testing. To consistently estimate the Shannon entropy of a distribution on $S$ elements with independent samples, the optimal sample complexity scales sublinearly with $S$ as $\Theta(\frac{S}{\log S})$ as shown by Valiant and Valiant \cite{Valiant--Valiant2011}. Extending the theory and algorithms for entropy estimation to dependent data, this paper considers the problem of estimating the entropy rate of a stationary reversible Markov chain with $S$ states from a sample path of $n$ observations. In comparison, the empirical entropy rate requires at least $\Omega(S 2)$ samples to be consistent, even when the Markov chain is memoryless. In addition to synthetic experiments, we also apply the estimators that achieve the optimal sample complexity to estimate the entropy rate of the English language in the Penn Treebank and the Google One Billion Words corpora, which provides a natural benchmark for language modeling and relates it directly to the widely used perplexity measure.