Optimization
On the Statistical Efficiency of $\ell_{1,p}$ Multi-Task Learning of Gaussian Graphical Models
Honorio, Jean, Jaakkola, Tommi, Samaras, Dimitris
We analyze the sufficient number of samples for the correct recovery of the support union and edge signs. We also analyze the necessary number of samples for any conceivable method by providing information-theoretic lower bounds. We compare the statistical efficiency of multi-task learning versus that of single-task learning. For experiments, we use a block coordinate descent method that is provably convergent and generates a sequence of positive definite solutions. We provide experimental validation on synthetic data as well as on two publicly available real-world data sets, including functional magnetic resonance imaging and gene expression data.
Efficient Blind Compressed Sensing Using Sparsifying Transforms with Convergence Guarantees and Application to MRI
Ravishankar, Saiprasad, Bresler, Yoram
Natural signals and images are well-known to be approximately sparse in transform domains such as Wavelets and DCT. This property has been heavily exploited in various applications in image processing and medical imaging. Compressed sensing exploits the sparsity of images or image patches in a transform domain or synthesis dictionary to reconstruct images from undersampled measurements. In this work, we focus on blind compressed sensing, where the underlying sparsifying transform is a priori unknown, and propose a framework to simultaneously reconstruct the underlying image as well as the sparsifying transform from highly undersampled measurements. The proposed block coordinate descent type algorithms involve highly efficient optimal updates. Importantly, we prove that although the proposed blind compressed sensing formulations are highly nonconvex, our algorithms are globally convergent (i.e., they converge from any initialization) to the set of critical points of the objectives defining the formulations. These critical points are guaranteed to be at least partial global and partial local minimizers. The exact point(s) of convergence may depend on initialization. We illustrate the usefulness of the proposed framework for magnetic resonance image reconstruction from highly undersampled k-space measurements. As compared to previous methods involving the synthesis dictionary model, our approach is much faster, while also providing promising reconstruction quality.
Generalized conditional gradient: analysis of convergence and applications
Rakotomamonjy, Alain, Flamary, Rรฉmi, Courty, Nicolas
The objectives of this technical report is to provide additional results on the generalized conditional gradient methods introduced by Bredies et al. [BLM05]. Indeed , when the objective function is smooth, we provide a novel certificate of optimality and we show that the algorithm has a linear convergence rate. Applications of this algorithm are also discussed.
Similarity Learning for High-Dimensional Sparse Data
Liu, Kuan, Bellet, Aurรฉlien, Sha, Fei
In many applications, such as text processing, computer vision or biology, data is represented as very highdimensional but sparse vectors. The ability to compute meaningful similarity scores between these objects is crucial to many tasks, such as classification, clustering or ranking. However, handcrafting a relevant similarity measure for such data is challenging because it is usually the case that only a small, often unknown subset of features is actually relevant to the task at hand. For instance, in drug discovery, chemical compounds can be represented as sparse features describing their 3D properties, and only a few of them play an role in determining whether the compound will bind to a target receptor (Guyon et al., 2004). In text classification, where each document is represented as a sparse bag of words, only a small subset of the words is generally sufficient to discriminate among documents of different topics. A principled way to obtain a similarity measure tailored to the problem of interest is to learn it from data. This line of research, known as similarity and distance metric learning, has been successfully applied to many application domains (see Kulis, 2012; Bellet et al., 2013, for recent surveys). The basic idea is to learn the parameters of a similarity (or distance) function such that it satisfies proximity-based constraints, requiring for instance that some data instance x be more similar to y than to z according to the learned function.
Regularization vs. Relaxation: A conic optimization perspective of statistical variable selection
Dong, Hongbo, Chen, Kun, Linderoth, Jeff
Variable selection is a fundamental task in statistical data analysis. Sparsity-inducing regularization methods are a popular class of methods that simultaneously perform variable selection and model estimation. The central problem is a quadratic optimization problem with an l0-norm penalty. Exactly enforcing the l0-norm penalty is computationally intractable for larger scale problems, so dif- ferent sparsity-inducing penalty functions that approximate the l0-norm have been introduced. In this paper, we show that viewing the problem from a convex relaxation perspective offers new insights. In particular, we show that a popular sparsity-inducing concave penalty function known as the Minimax Concave Penalty (MCP), and the reverse Huber penalty derived in a recent work by Pilanci, Wainwright and Ghaoui, can both be derived as special cases of a lifted convex relaxation called the perspective relaxation. The optimal perspective relaxation is a related minimax problem that balances the overall convexity and tightness of approximation to the l0 norm. We show it can be solved by a semidefinite relaxation. Moreover, a probabilistic interpretation of the semidefinite relaxation reveals connections with the boolean quadric polytope in combinatorial optimization. Finally by reformulating the l0-norm pe- nalized problem as a two-level problem, with the inner level being a Max-Cut problem, our proposed semidefinite relaxation can be realized by replacing the inner level problem with its semidefinite relaxation studied by Goemans and Williamson. This interpretation suggests using the Goemans-Williamson rounding procedure to find approximate solutions to the l0-norm penalized problem. Numerical experiments demonstrate the tightness of our proposed semidefinite relaxation, and the effectiveness of finding approximate solutions by Goemans-Williamson rounding.
Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits
Wu, Huasen, Srikant, R., Liu, Xin, Jiang, Chong
We study contextual bandits with budget and time constraints, referred to as constrained contextual bandits.The time and budget constraints significantly complicate the exploration and exploitation tradeoff because they introduce complex coupling among contexts over time.Such coupling effects make it difficult to obtain oracle solutions that assume known statistics of bandits. To gain insight, we first study unit-cost systems with known context distribution. When the expected rewards are known, we develop an approximation of the oracle, referred to Adaptive-Linear-Programming (ALP), which achieves near-optimality and only requires the ordering of expected rewards. With these highly desirable features, we then combine ALP with the upper-confidence-bound (UCB) method in the general case where the expected rewards are unknown {\it a priori}. We show that the proposed UCB-ALP algorithm achieves logarithmic regret except for certain boundary cases. Further, we design algorithms and obtain similar regret analysis results for more general systems with unknown context distribution and heterogeneous costs. To the best of our knowledge, this is the first work that shows how to achieve logarithmic regret in constrained contextual bandits. Moreover, this work also sheds light on the study of computationally efficient algorithms for general constrained contextual bandits.
Piecewise-Linear Approximation for Feature Subset Selection in a Sequential Logit Model
Sato, Toshiki, Takano, Yuichi, Miyashiro, Ryuhei
This paper concerns a method of selecting a subset of features for a sequential logit model. Tanaka and Nakagawa (2014) proposed a mixed integer quadratic optimization formulation for solving the problem based on a quadratic approximation of the logistic loss function. However, since there is a significant gap between the logistic loss function and its quadratic approximation, their formulation may fail to find a good subset of features. To overcome this drawback, we apply a piecewise-linear approximation to the logistic loss function. Accordingly, we frame the feature subset selection problem of minimizing an information criterion as a mixed integer linear optimization problem. The computational results demonstrate that our piecewise-linear approximation approach found a better subset of features than the quadratic approximation approach.
An optimal randomized incremental gradient method
In this paper, we consider a class of finite-sum convex optimization problems whose objective function is given by the summation of $m$ ($\ge 1$) smooth components together with some other relatively simple terms. We first introduce a deterministic primal-dual gradient (PDG) method that can achieve the optimal black-box iteration complexity for solving these composite optimization problems using a primal-dual termination criterion. Our major contribution is to develop a randomized primal-dual gradient (RPDG) method, which needs to compute the gradient of only one randomly selected smooth component at each iteration, but can possibly achieve better complexity than PDG in terms of the total number of gradient evaluations. More specifically, we show that the total number of gradient evaluations performed by RPDG can be ${\cal O} (\sqrt{m})$ times smaller, both in expectation and with high probability, than those performed by deterministic optimal first-order methods under favorable situations. We also show that the complexity of the RPDG method is not improvable by developing a new lower complexity bound for a general class of randomized methods for solving large-scale finite-sum convex optimization problems. Moreover, through the development of PDG and RPDG, we introduce a novel game-theoretic interpretation for these optimal methods for convex optimization.
Accelerating Optimization via Adaptive Prediction
We present a powerful general framework for designing data-dependent optimization algorithms, building upon and unifying recent techniques in adaptive regularization, optimistic gradient predictions, and problem-dependent randomization. We first present a series of new regret guarantees that hold at any time and under very minimal assumptions, and then show how different relaxations recover existing algorithms, both basic as well as more recent sophisticated ones. Finally, we show how combining adaptivity, optimism, and problem-dependent randomization can guide the design of algorithms that benefit from more favorable guarantees than recent state-of-the-art methods.
Functional Frank-Wolfe Boosting for General Loss Functions
Wang, Chu, Wang, Yingfei, E, Weinan, Schapire, Robert
Boosting is a generic learning method for classification and regression. Yet, as the number of base hypotheses becomes larger, boosting can lead to a deterioration of test performance. Overfitting is an important and ubiquitous phenomenon, especially in regression settings. To avoid overfitting, we consider using $l_1$ regularization. We propose a novel Frank-Wolfe type boosting algorithm (FWBoost) applied to general loss functions. By using exponential loss, the FWBoost algorithm can be rewritten as a variant of AdaBoost for binary classification. FWBoost algorithms have exactly the same form as existing boosting methods, in terms of making calls to a base learning algorithm with different weights update. This direct connection between boosting and Frank-Wolfe yields a new algorithm that is as practical as existing boosting methods but with new guarantees and rates of convergence. Experimental results show that the test performance of FWBoost is not degraded with larger rounds in boosting, which is consistent with the theoretical analysis.