AITopics | Luo, Zhi-Quan

Collaborating Authors

Luo, Zhi-Quan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Adam Can Converge Without Any Modification On Update Rules

Zhang, Yushun, Chen, Congliang, Shi, Naichen, Sun, Ruoyu, Luo, Zhi-Quan

arXiv.org Artificial IntelligenceJan-13-2023

Ever since Reddi et al. 2018 pointed out the divergence issue of Adam, many new variants have been designed to obtain convergence. However, vanilla Adam remains exceptionally popular and it works well in practice. Why is there a gap between theory and practice? We point out there is a mismatch between the settings of theory and practice: Reddi et al. 2018 pick the problem after picking the hyperparameters of Adam, i.e., $(\beta_1, \beta_2)$; while practical applications often fix the problem first and then tune $(\beta_1, \beta_2)$. Due to this observation, we conjecture that the empirical convergence can be theoretically justified, only if we change the order of picking the problem and hyperparameter. In this work, we confirm this conjecture. We prove that, when $\beta_2$ is large and $\beta_1 < \sqrt{\beta_2}<1$, Adam converges to the neighborhood of critical points. The size of the neighborhood is propositional to the variance of stochastic gradients. Under an extra condition (strong growth condition), Adam converges to critical points. It is worth mentioning that our results cover a wide range of hyperparameters: as $\beta_2$ increases, our convergence result can cover any $\beta_1 \in [0,1)$ including $\beta_1=0.9$, which is the default setting in deep learning libraries. To our knowledge, this is the first result showing that Adam can converge without any modification on its update rules. Further, our analysis does not require assumptions of bounded gradients or bounded 2nd-order momentum. When $\beta_2$ is small, we further point out a large region of $(\beta_1,\beta_2)$ where Adam can diverge to infinity. Our divergence result considers the same setting as our convergence result, indicating a phase transition from divergence to convergence when increasing $\beta_2$. These positive and negative results can provide suggestions on how to tune Adam hyperparameters.

artificial intelligence, machine learning, modification, (2 more...)

arXiv.org Artificial Intelligence

2208.09632

Genre: Research Report > New Finding (0.53)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Adversarial Rademacher Complexity of Deep Neural Networks

Xiao, Jiancong, Fan, Yanbo, Sun, Ruoyu, Luo, Zhi-Quan

arXiv.org Artificial IntelligenceNov-27-2022

Deep neural networks are vulnerable to adversarial attacks. Ideally, a robust model shall perform well on both the perturbed training data and the unseen perturbed test data. It is found empirically that fitting perturbed training data is not hard, but generalizing to perturbed test data is quite difficult. To better understand adversarial generalization, it is of great interest to study the adversarial Rademacher complexity (ARC) of deep neural networks. However, how to bound ARC in multi-layers cases is largely unclear due to the difficulty of analyzing adversarial loss in the definition of ARC. There have been two types of attempts of ARC. One is to provide the upper bound of ARC in linear and one-hidden layer cases. However, these approaches seem hard to extend to multi-layer cases. Another is to modify the adversarial loss and provide upper bounds of Rademacher complexity on such surrogate loss in multi-layer cases. However, such variants of Rademacher complexity are not guaranteed to be bounds for meaningful robust generalization gaps (RGG). In this paper, we provide a solution to this unsolved problem. Specifically, we provide the first bound of adversarial Rademacher complexity of deep neural networks. Our approach is based on covering numbers. We provide a method to handle the robustify function classes of DNNs such that we can calculate the covering numbers. Finally, we provide experiments to study the empirical implication of our bounds and provide an analysis of poor adversarial generalization.

artificial intelligence, machine learning, rademacher complexity, (13 more...)

arXiv.org Artificial Intelligence

2211.14966

Country: Asia > China (0.28)

Genre: Research Report (0.82)

Industry:

Information Technology > Security & Privacy (0.34)
Government (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Optimally Combining Classifiers for Semi-Supervised Learning

Wang, Zhiguo, Yang, Liusha, Yin, Feng, Lin, Ke, Shi, Qingjiang, Luo, Zhi-Quan

arXiv.org Machine LearningJun-7-2020

This paper considers semi-supervised learning for tabular data. It is widely known that Xgboost based on tree model works well on the heterogeneous features while transductive support vector machine can exploit the low density separation assumption. However, little work has been done to combine them together for the end-to-end semi-supervised learning. In this paper, we find these two methods have complementary properties and larger diversity, which motivates us to propose a new semi-supervised learning method that is able to adaptively combine the strengths of Xgboost and transductive support vector machine. Instead of the majority vote rule, an optimization problem in terms of ensemble weight is established, which helps to obtain more accurate pseudo labels for unlabeled data. The experimental results on the UCI data sets and real commercial data set demonstrate the superior classification performance of our method over the five state-of-the-art algorithms improving test accuracy by about $3\%-4\%$. The partial code can be found at https://github.com/hav-cam-mit/CTO.

artificial intelligence, classifier, machine learning, (16 more...)

arXiv.org Machine Learning

2006.04097

Country:

Asia > China > Guangdong Province (0.15)
North America > United States > Wisconsin (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.95)

Add feedback

On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization

Hou, Ke, Zhou, Zirui, So, Anthony Man-Cho, Luo, Zhi-Quan

Neural Information Processing SystemsFeb-14-2020, 15:13:40 GMT

Motivated by various applications in machine learning, the problem of minimizing a convex smooth loss function with trace norm regularization has received much attention lately. Currently, a popular method for solving such problem is the proximal gradient method (PGM), which is known to have a sublinear rate of convergence. In this paper, we show that for a large class of loss functions, the convergence rate of the PGM is in fact linear. Our result is established without any strong convexity assumption on the loss function. A key ingredient in our proof is a new Lipschitzian error bound for the aforementioned trace norm-regularized problem, which may be of independent interest.

artificial intelligence, machine learning, trace norm regularization, (3 more...)

Neural Information Processing Systems

Genre: Research Report (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.56)

Add feedback

A General $\mathcal{O}(n^2)$ Hyper-Parameter Optimization for Gaussian Process Regression with Cross-Validation and Non-linearly Constrained ADMM

Xu, Linning, Yin, Feng, Zhang, Jiawei, Luo, Zhi-Quan, Cui, Shuguang

arXiv.org Machine LearningJun-6-2019

Hyper-parameter optimization remains as the core issue of Gaussian process (GP) for machine learning nowadays. The benchmark method using maximum likelihood (ML) estimation and gradient descent (GD) is impractical for processing big data due to its $O(n^3)$ complexity. Many sophisticated global or local approximation models, for instance, sparse GP, distributed GP, have been proposed to address such complexity issue. In this paper, we propose two novel and general-purpose GP hyper-parameter training schemes (GPCV-ADMM) by replacing ML with cross-validation (CV) as the fitting criterion and replacing GD with a non-linearly constrained alternating direction method of multipliers (ADMM) as the optimization method. The proposed schemes are of $O(n^2)$ complexity for any covariance matrix without special structure. We conduct various experiments based on both synthetic and real data sets, wherein the proposed schemes show excellent performance in terms of convergence, hyper-parameter estimation accuracy, and computational time in comparison with the traditional ML based routines given in the GPML toolbox.

artificial intelligence, bayesian inference, complexity, (18 more...)

arXiv.org Machine Learning

1906.02387

Country:

North America > United States > Massachusetts > Middlesex County (0.14)
Asia > China > Guangdong Province (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Cross Validation (0.61)
(2 more...)

Add feedback

Bilinear Factor Matrix Norm Minimization for Robust PCA: Algorithms and Applications

Shang, Fanhua, Cheng, James, Liu, Yuanyuan, Luo, Zhi-Quan, Lin, Zhouchen

arXiv.org Machine LearningOct-11-2018

The heavy-tailed distributions of corrupted outliers and singular values of all channels in low-level vision have proven effective priors for many applications such as background modeling, photometric stereo and image alignment. And they can be well modeled by a hyper-Laplacian. However, the use of such distributions generally leads to challenging non-convex, non-smooth and non-Lipschitz problems, and makes existing algorithms very slow for large-scale applications. Together with the analytic solutions to lp-norm minimization with two specific values of p, i.e., p=1/2 and p=2/3, we propose two novel bilinear factor matrix norm minimization models for robust principal component analysis. We first define the double nuclear norm and Frobenius/nuclear hybrid norm penalties, and then prove that they are in essence the Schatten-1/2 and 2/3 quasi-norms, respectively, which lead to much more tractable and scalable Lipschitz optimization problems. Our experimental analysis shows that both our methods yield more accurate solutions than original Schatten quasi-norm minimization, even when the number of observations is very limited. Finally, we apply our penalties to various low-level vision problems, e.g., text removal, moving object detection, image alignment and inpainting, and show that our methods usually outperform the state-of-the-art methods.

health & medicine, matrix, survey article, (19 more...)

arXiv.org Machine Learning

1810.05186

Country:

Asia > China (0.93)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.27)

Genre: Research Report > Promising Solution (0.34)

Industry: Health & Medicine (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Vision (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.87)

Add feedback

Computational Intractability of Dictionary Learning for Sparse Representation

Razaviyayn, Meisam, Tseng, Hung-Wei, Luo, Zhi-Quan

arXiv.org Machine LearningNov-5-2015

In this paper we consider the dictionary learning problem for sparse representation. We first show that this problem is NP-hard by polynomial time reduction of the densest cut problem. Then, using successive convex approximation strategies, we propose efficient dictionary learning schemes to solve several practical formulations of this problem to stationary points. Unlike many existing algorithms in the literature, such as K-SVD, our proposed dictionary learning scheme is theoretically guaranteed to converge to the set of stationary points under certain mild assumptions. For the image denoising application, the performance and the efficiency of the proposed dictionary learning scheme are comparable to that of K-SVD algorithm in simulation.

algorithm, artificial intelligence, optimization problem, (16 more...)

arXiv.org Machine Learning

1511.01776

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.50)

Industry: Education (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)

Add feedback

Parallel Direction Method of Multipliers

Wang, Huahua, Banerjee, Arindam, Luo, Zhi-Quan

Neural Information Processing SystemsDec-31-2014

We consider the problem of minimizing block-separable convex functions subject to linear constraints. While the Alternating Direction Method of Multipliers (ADMM) for two-block linear constraints has been intensively studied both theoretically and empirically, in spite of some preliminary work, effective generalizations of ADMM to multiple blocks is still unclear. In this paper, we propose a parallel randomized block coordinate method named Parallel Direction Method of Multipliers (PDMM) to solve the optimization problems with multi-block linear constraints. PDMM randomly updates some blocks in parallel, behaving like parallel randomized block coordinate descent. We establish the global convergence and the iteration complexity for PDMM with constant step size. We also show that PDMM can do randomized block coordinate descent on overlapping blocks. Experimental results show that PDMM performs better than state-of-the-arts methods in two applications, robust principal component analysis and overlapping group lasso.

artificial intelligence, optimization problem, pdmm, (15 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)

Add feedback

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

Razaviyayn, Meisam, Hong, Mingyi, Luo, Zhi-Quan, Pang, Jong-Shi

Neural Information Processing SystemsDec-31-2014

Consider the problem of minimizing the sum of a smooth (possibly non-convex) and a convex (possibly nonsmooth) function involving a large number of variables. A popular approach to solve this problem is the block coordinate descent (BCD) method whereby at each iteration only one variable block is updated while the remaining variables are held fixed. With the recent advances in the developments of the multi-core parallel processing technology, it is desirable to parallelize the BCD method by allowing multiple blocks to be updated simultaneously at each iteration of the algorithm. In this work, we propose an inexact parallel BCD approach where at each iteration, a subset of the variables is updated in parallel by minimizing convex approximations of the original objective function. We investigate the convergence of this parallel BCD method for both randomized and cyclic variable selection rules. We analyze the asymptotic and non-asymptotic convergence behavior of the algorithm for both convex and non-convex objective functions. The numerical experiments suggest that for a special case of Lasso minimization problem, the cyclic block selection rule can outperform the randomized rule.

algorithm, artificial intelligence, optimization problem, (13 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.28)

Technology: