AITopics | Wei, Chen-Yu

Collaborating Authors

Wei, Chen-Yu

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

Wei, Chen-Yu, Jafarnia-Jahromi, Mehdi, Luo, Haipeng, Jain, Rahul

arXiv.org Machine LearningJul-23-2020

We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation. Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal $\widetilde{O}(\sqrt{T})$ regret and another computationally efficient variant with $\widetilde{O}(T^{3/4})$ regret, where $T$ is the number of interactions. Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with $\widetilde{O}(\sqrt{T})$ regret under a different set of assumptions, improving the best existing result by Hao et al. (2020) with $\widetilde{O}(T^{2/3})$ regret. Moreover, we draw a connection between this algorithm and the Natural Policy Gradient algorithm proposed by Kakade (2002), and show that our analysis improves the sample complexity bound recently given by Agarwal et al. (2020).

algorithm, artificial intelligence, fuzzy logic, (17 more...)

arXiv.org Machine Learning

2007.11849

Country: North America > United States > California (0.14)

Genre:

Workflow (0.67)
Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Add feedback

A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs with Near-optimal Regret

Jafarnia-Jahromi, Mehdi, Wei, Chen-Yu, Jain, Rahul, Luo, Haipeng

arXiv.org Machine LearningJun-8-2020

Recently, model-free reinforcement learning has attracted research attention due to its simplicity, memory and computation efficiency, and the flexibility to combine with function approximation. In this paper, we propose Exploration Enhanced Q-learning (EE-QL), a model-free algorithm for infinite-horizon average-reward Markov Decision Processes (MDPs) that achieves regret bound of $O(\sqrt{T})$ for the general class of weakly communicating MDPs, where $T$ is the number of interactions. EE-QL assumes that an online concentrating approximation of the optimal average reward is available. This is the first model-free learning algorithm that achieves $O(\sqrt T)$ regret without the ergodic assumption, and matches the lower bound in terms of $T$ except for logarithmic factors. Experiments show that the proposed algorithm performs as well as the best known model-based algorithms.

artificial intelligence, infinite-horizon average-reward mdp, reinforcement learning, (3 more...)

arXiv.org Machine Learning

2006.04354

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.73)

Add feedback

Taking a hint: How to leverage loss predictors in contextual bandits?

Wei, Chen-Yu, Luo, Haipeng, Agarwal, Alekh

arXiv.org Machine LearningMar-4-2020

We initiate the study of learning in contextual bandits with the help of loss predictors. The main question we address is whether one can improve over the minimax regret $\mathcal{O}(\sqrt{T})$ for learning over $T$ rounds, when the total error of the predictor $\mathcal{E} \leq T$ is relatively small. We provide a complete answer to this question, including upper and lower bounds for various settings: adversarial versus stochastic environments, known versus unknown $\mathcal{E}$, and single versus multiple predictors. We show several surprising results, such as 1) the optimal regret is $\mathcal{O}(\min\{\sqrt{T}, \sqrt{\mathcal{E}}T^\frac{1}{4}\})$ when $\mathcal{E}$ is known, a sharp contrast to the standard and better bound $\mathcal{O}(\sqrt{\mathcal{E}})$ for non-contextual problems (such as multi-armed bandits); 2) the same bound cannot be achieved if $\mathcal{E}$ is unknown, but as a remedy, $\mathcal{O}(\sqrt{\mathcal{E}}T^\frac{1}{3})$ is achievable; 3) with $M$ predictors, a linear dependence on $M$ is necessary, even if logarithmic dependence is possible for non-contextual problems. We also develop several novel algorithmic techniques to achieve matching upper bounds, including 1) a key action remapping technique for optimal regret with known $\mathcal{E}$, 2) implementing Catoni's robust mean estimator efficiently via an ERM oracle leading to an efficient algorithm in the stochastic setting with optimal regret, 3) constructing an underestimator for $\mathcal{E}$ via estimating the histogram with bins of exponentially increasing size for the stochastic setting with unknown $\mathcal{E}$, and 4) a self-referential scheme for learning with multiple predictors, all of which might be of independent interest.

artificial intelligence, big data, predictor, (20 more...)

arXiv.org Machine Learning

2003.01922

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.66)

Add feedback

Online Reinforcement Learning in Stochastic Games

Wei, Chen-Yu, Hong, Yi-Te, Lu, Chi-Jen

Neural Information Processing SystemsFeb-14-2020, 16:43:08 GMT

We study online reinforcement learning in average-reward stochastic games (SGs). An SG models a two-player zero-sum game in a Markov environment, where state transitions and one-step payoffs are determined simultaneously by a learner and an adversary. We propose the \textsc{UCSG} algorithm that achieves a sublinear regret compared to the game value when competing with an arbitrary opponent. This result improves previous ones under the same setting. The regret bound has a dependency on the \textit{diameter}, which is an intrinsic value related to the mixing property of SGs.

artificial intelligence, machine learning, reinforcement learning, (5 more...)

Neural Information Processing Systems

Genre: Instructional Material > Online (0.65)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.65)

Add feedback

Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes

Wei, Chen-Yu, Jafarnia-Jahromi, Mehdi, Luo, Haipeng, Sharma, Hiteshi, Jain, Rahul

arXiv.org Artificial IntelligenceOct-15-2019

Model-free reinforcement learning is known to be memory and computation efficient and more amendable to large scale problems. In this paper, two model-free algorithms are introduced for learning infinite-horizon average-reward Markov Decision Processes (MDPs). The first algorithm reduces the problem to the discounted-reward version and achieves $\mathcal{O}(T^{2/3})$ regret after $T$ steps, under the minimal assumption of weakly communicating MDPs. The second algorithm makes use of recent advances in adaptive algorithms for adversarial multi-armed bandits and improves the regret to $\mathcal{O}(\sqrt{T})$, albeit with a stronger ergodic assumption. To the best of our knowledge, these are the first model-free algorithms with sub-linear regret (that is polynomial in all parameters) in the infinite-horizon average-reward setting.

artificial intelligence, null, survey article, (19 more...)

arXiv.org Artificial Intelligence

1910.07072

Country: North America > United States > California (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.84)

Add feedback

Analyzing the Variance of Policy Gradient Estimators for the Linear-Quadratic Regulator

Preiss, James A., Arnold, Sébastien M. R., Wei, Chen-Yu, Kloft, Marius

arXiv.org Machine LearningOct-2-2019

We study the variance of the REINFORCE policy gradient estimator in environments with continuous state and action spaces, linear dynamics, quadratic cost, and Gaussian noise. These simple environments allow us to derive bounds on the estimator variance in terms of the environment and noise parameters. We compare the predictions of our bounds to the empirical variance in simulation experiments.

artificial intelligence, null, optimization problem, (18 more...)

arXiv.org Machine Learning

1910.01249

Country: North America > United States > California (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

Bandit Multiclass Linear Classification: Efficient Algorithms for the Separable Case

Beygelzimer, Alina, Pál, Dávid, Szörényi, Balázs, Thiruvenkatachari, Devanathan, Wei, Chen-Yu, Zhang, Chicheng

arXiv.org Machine LearningFeb-6-2019

We study the problem of efficient online multiclass linear classification with bandit feedback, where all examples belong to one of $K$ classes and lie in the $d$-dimensional Euclidean space. Previous works have left open the challenge of designing efficient algorithms with finite mistake bounds when the data is linearly separable by a margin $\gamma$. In this work, we take a first step towards this problem. We consider two notions of linear separability, \emph{strong} and \emph{weak}. 1. Under the strong linear separability condition, we design an efficient algorithm that achieves a near-optimal mistake bound of $O\left( K/\gamma^2 \right)$. 2. Under the more challenging weak linear separability condition, we design an efficient algorithm with a mistake bound of $\min (2^{\widetilde{O}(K \log^2 (1/\gamma))}, 2^{\widetilde{O}(\sqrt{1/\gamma} \log K)})$. Our algorithm is based on kernel Perceptron, which is inspired by the work of \citet{Klivans-Servedio-2008} on improperly learning intersection of halfspaces.

algorithm, artificial intelligence, neural network, (18 more...)

arXiv.org Machine Learning

1902.02244

Country: North America > United States > California > Los Angeles County > Los Angeles (0.14)

Genre: Research Report (0.64)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.34)

Add feedback

A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free

Chen, Yifang, Lee, Chung-Wei, Luo, Haipeng, Wei, Chen-Yu

arXiv.org Machine LearningFeb-5-2019

We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret. Specifically, our algorithm achieves dynamic regret $\mathcal{O}(\min\{\sqrt{ST}, \Delta^{\frac{1}{3}}T^{\frac{2}{3}}\})$ for a contextual bandit problem with $T$ rounds, $S$ switches and $\Delta$ total variation in data distributions. Importantly, our algorithm is adaptive and does not need to know $S$ or $\Delta$ ahead of time, and can be implemented efficiently assuming access to an ERM oracle. Our results strictly improve the $\mathcal{O}(\min \{S^{\frac{1}{4}}T^{\frac{3}{4}}, \Delta^{\frac{1}{5}}T^{\frac{4}{5}}\})$ bound of (Luo et al., 2018), and greatly generalize and improve the $\mathcal{O}(\sqrt{ST})$ result of (Auer et al, 2018) that holds only for the two-armed bandit problem without contextual information. The key novelty of our algorithm is to introduce replay phases, in which the algorithm acts according to its previous decisions for a certain amount of time in order to detect non-stationarity while maintaining a good balance between exploration and exploitation.

algorithm, artificial intelligence, big data, (18 more...)

arXiv.org Machine Learning

1902.0098

Country: North America > United States > California (0.14)

Genre: Research Report (0.70)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Improved Path-length Regret Bounds for Bandits

Bubeck, Sébastien, Li, Yuanzhi, Luo, Haipeng, Wei, Chen-Yu

arXiv.org Machine LearningJan-29-2019

We study adaptive regret bounds in terms of the variation of the losses (the so-called path-length bounds) for both multi-armed bandit and more generally linear bandit. We first show that the seemingly suboptimal path-length bound of (Wei and Luo, 2018) is in fact not improvable for adaptive adversary. Despite this negative result, we then develop two new algorithms, one that strictly improves over (Wei and Luo, 2018) with a smaller path-length measure, and the other which improves over (Wei and Luo, 2018) for oblivious adversary when the path-length is large. Our algorithms are based on the well-studied optimistic mirror descent framework, but importantly with several novel techniques, including new optimistic predictions, a slight bias towards recently selected arms, and the use of a hybrid regularizer similar to that of (Bubeck et al., 2018). Furthermore, we extend our results to linear bandit by showing a reduction to obtaining dynamic regret for a full-information problem, followed by a further reduction to convex body chasing. We propose a simple greedy chasing algorithm for squared 2-norm, leading to new dynamic regret results and as a consequence the first path-length regret for general linear bandit as well.

algorithm, artificial intelligence, big data, (19 more...)

arXiv.org Machine Learning

1901.10604

Country: North America > United States > California (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.67)

Add feedback

Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously

Zimmert, Julian, Luo, Haipeng, Wei, Chen-Yu

arXiv.org Machine LearningJan-25-2019

We develop the first general semi-bandit algorithm that simultaneously achieves $\mathcal{O}(\log T)$ regret for stochastic environments and $\mathcal{O}(\sqrt{T})$ regret for adversarial environments without knowledge of the regime or the number of rounds $T$. The leading problem-dependent constants of our bounds are not only optimal in some worst-case sense studied previously, but also optimal for two concrete instances of semi-bandit problems. Our algorithm and analysis extend the recent work of (Zimmert & Seldin, 2019) for the special case of multi-armed bandit, but importantly requires a novel hybrid regularizer designed specifically for semi-bandit. Experimental results on synthetic data show that our algorithm indeed performs well uniformly over different environments. We finally provide a preliminary extension of our results to the full bandit feedback.

algorithm, artificial intelligence, big data, (19 more...)

arXiv.org Machine Learning

1901.08779

Country: North America > United States > California (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.89)

Add feedback