Park, Hongju
Graph Canonical Correlation Analysis
Park, Hongju, Bai, Shuyang, Ye, Zhenyao, Lee, Hwiyoung, Ma, Tianzhou, Chen, Shuo
CCA considers the following maximization problem: max a,b(a X Y b) subject to a X X a 1 and b Y Y b 1, where the vectors a and b and the correlation are said to be canonical vectors and canonical correlation if they attain the above maximization. In the classical canonical correlation analysis, the canonical vectors a and b include nonzero loadings for all X and Y variables. However, in a high-dimensional setting with p, q n, the goal is to identify which subsets of X are associated with subsets Y and estimate the measure of associations, as the canonical correlation with the full dataset is overly high due to estimation bias caused by overfitting. To ensure the sparsity, shrinkage methods 4 Biometrics, 000 0000 are commonly used. For example, Witten et al. (2009) propose sparse canonical correlation analysis (sCCA). The criterion of sCCA can be in general expressed as follows: max a,b a X Y b subject to a X X a 1, b Y Y b 1, P 1( a) k 1, P 2( b) k 2, where P 1 and P 2 are convex penalty functions for penalization for a and b with positive constants k 1 and k 2, respectively. A representative penalty function is a โ 1 penalty function such that P 1(a) = a 1 and P 2(b) = b 1. sCCA imposes zero loadings in canonical vectors and thus only selects subsets of correlated X and Y . However, sCCA methods may neither fully recover correlated X and Y pairs nor capture the multivariate-to-multivariate linkage patterns (see Figure 3) because the โ 1 shrinkage tends to select only a small subset from the associated variables of X and Y .
Thompson Sampling in Partially Observable Contextual Bandits
Park, Hongju, Faradonbeh, Mohamad Kazem Shirani
Contextual bandits constitute a classical framework for decision-making under uncertainty. In this setting, the goal is to learn the arms of highest reward subject to contextual information, while the unknown reward parameters of each arm need to be learned by experimenting that specific arm. Accordingly, a fundamental problem is that of balancing exploration (i.e., pulling different arms to learn their parameters), versus exploitation (i.e., pulling the best arms to gain reward). To study this problem, the existing literature mostly considers perfectly observed contexts. However, the setting of partial context observations remains unexplored to date, despite being theoretically more general and practically more versatile. We study bandit policies for learning to select optimal arms based on the data of observations, which are noisy linear functions of the unobserved context vectors. Our theoretical analysis shows that the Thompson sampling policy successfully balances exploration and exploitation. Specifically, we establish the followings: (i) regret bounds that grow poly-logarithmically with time, (ii) square-root consistency of parameter estimation, and (iii) scaling of the regret with other quantities including dimensions and number of arms. Extensive numerical experiments with both real and synthetic data are presented as well, corroborating the efficacy of Thompson sampling. To establish the results, we introduce novel martingale techniques and concentration inequalities to address partially observed dependent random variables generated from unspecified distributions, and also leverage problem-dependent information to sharpen probabilistic bounds for time-varying suboptimality gaps. These techniques pave the road towards studying other decision-making problems with contextual information as well as partial observations.
Worst-case Performance of Greedy Policies in Bandits with Imperfect Context Observations
Park, Hongju, Faradonbeh, Mohamad Kazem Shirani
Contextual bandits are canonical models for sequential decision-making under uncertainty in environments with time-varying components. In this setting, the expected reward of each bandit arm consists of the inner product of an unknown parameter with the context vector of that arm. The classical bandit settings heavily rely on assuming that the contexts are fully observed, while study of the richer model of imperfectly observed contextual bandits is immature. This work considers Greedy reinforcement learning policies that take actions as if the current estimates of the parameter and of the unobserved contexts coincide with the corresponding true values. We establish that the non-asymptotic worst-case regret grows poly-logarithmically with the time horizon and the failure probability, while it scales linearly with the number of arms. Numerical analysis showcasing the above efficiency of Greedy policies is also provided.
Efficient Algorithms for Learning to Control Bandits with Unobserved Contexts
Park, Hongju, Faradonbeh, Mohamad Kazem Shirani
Contextual bandits are commonly used for sequential decision-making with finitely many control actions. In this setting, available context observations can be utilized in a tractable way, thanks to the linearity of the relationship between the reward and the context vectors. The arms provide rewards depending on the contexts that represent their individual characteristics. The range of real-world applications is notably extensive, including personalized recommendations for Mobile Context-Aware Recommender Systems and mobile-health interventions [1, 2, 3]. To get satisfactory performances in bandits, the exploration-exploitation trade-off must be addressed. The theoretical analysis of efficient policies for the multi-armed bandits goes back to algorithms that decide based on Upper-Confident-Bounds (UCB) [4]. In fact, UCB employs an optimistic approximate of the unknown reward based on the history of observations, to allow an appropriate degree of exploration. Further theoretical results for UCB in contextual bandits, as well as in other settings, are available in the literature [5, 6, 7, 8, 9]. Posterior sampling is another ubiquitous reinforcement learning algorithm that effectively balances exploitation versus exploration.
Analysis of Thompson Sampling for Partially Observable Contextual Multi-Armed Bandits
Park, Hongju, Faradonbeh, Mohamad Kazem Shirani
Contextual multi-armed bandits are classical models in reinforcement learning for sequential decision-making associated with individual information. A widely-used policy for bandits is Thompson Sampling, where samples from a data-driven probabilistic belief about unknown parameters are used to select the control actions. For this computationally fast algorithm, performance analyses are available under full context-observations. However, little is known for problems that contexts are not fully observed. We propose a Thompson Sampling algorithm for partially observable contextual multi-armed bandits, and establish theoretical performance guarantees. Technically, we show that the regret of the presented policy scales logarithmically with time and the number of arms, and linearly with the dimension. Further, we establish rates of learning unknown parameters, and provide illustrative numerical analyses.