Miyaguchi, Kohei
Cumulative Stay-time Representation for Electronic Health Records in Medical Event Time Prediction
Katsuki, Takayuki, Miyaguchi, Kohei, Koseki, Akira, Iwamori, Toshiya, Yanagiya, Ryosuke, Suzuki, Atsushi
We address the problem of predicting when a disease will develop, i.e., medical event time (MET), from a patient's electronic health record (EHR). The MET of non-communicable diseases like diabetes is highly correlated to cumulative health conditions, more specifically, how much time the patient spent with specific health conditions in the past. The common time-series representation is indirect in extracting such information from EHR because it focuses on detailed dependencies between values in successive observations, not cumulative information. We propose a novel data representation for EHR called cumulative stay-time representation (CTR), which directly models such cumulative health conditions. We derive a trainable construction of CTR based on neural networks that has the flexibility to fit the target data and scalability to handle high-dimensional EHR. Numerical experiments using synthetic and real-world datasets demonstrate that CTR alone achieves a high prediction performance, and it enhances the performance of existing models when combined with them.
Biases in In Silico Evaluation of Molecular Optimization Methods and Bias-Reduced Evaluation Methodology
Kajino, Hiroshi, Miyaguchi, Kohei, Osogami, Takayuki
Molecular optimization aims to discover novel molecules with improved properties, which is often formulated as a reinforcement learning problem by modeling the construction of a molecule using a Markov decision process. The performance of such agents is measured by the quality of generated molecules. In the community of machine learning, most of the molecular optimization methods have been verified in silico, i.e., in computer simulation. Since most of the generated molecules are novel, their properties are unknown and we have to resort to a predictor to estimate the properties. However, little attention has been paid to how reliable such estimates are, which makes the existing performance estimates less reliable. In this paper, we study the statistical performance of such performance estimators to enhance our understanding of the evaluation protocol and we discuss several directions to improve it. Let us first introduce a common practice to estimate the performance in silico.
PAC-Bayesian Transportation Bound
Miyaguchi, Kohei
We present a new generalization error bound, the \emph{PAC-Bayesian transportation bound}, unifying the PAC-Bayesian analysis and the generic chaining method in view of the optimal transportation. The proposed bound is the first PAC-Bayesian framework that characterizes the cost of de-randomization of stochastic predictors facing any Lipschitz loss functions. As an example, we give an upper bound on the de-randomization cost of spectrally normalized neural networks~(NNs) to evaluate how much randomness contributes to the generalization of NNs.
Adaptive Minimax Regret against Smooth Logarithmic Losses over High-Dimensional $\ell_1$-Balls via Envelope Complexity
Miyaguchi, Kohei, Yamanishi, Kenji
We develop a new theoretical framework, the \emph{envelope complexity}, to analyze the minimax regret with logarithmic loss functions and derive a Bayesian predictor that adaptively achieves the minimax regret over high-dimensional $\ell_1$-balls within a factor of two. The prior is newly derived for achieving the minimax regret and called the \emph{spike-and-tails~(ST) prior} as it looks like. The resulting regret bound is so simple that it is completely determined with the smoothness of the loss function and the radius of the balls except with logarithmic factors, and it has a generalized form of existing regret/risk bounds. In the preliminary experiment, we confirm that the ST prior outperforms the conventional minimax-regret prior under non-high-dimensional asymptotics.
High-dimensional Penalty Selection via Minimum Description Length Principle
Miyaguchi, Kohei, Yamanishi, Kenji
We tackle the problem of penalty selection of regularization on the basis of the minimum description length (MDL) principle. In particular, we consider that the design space of the penalty function is high-dimensional. In this situation, the luckiness-normalized-maximum-likelihood(LNML)-minimization approach is favorable, because LNML quantifies the goodness of regularized models with any forms of penalty functions in view of the minimum description length principle, and guides us to a good penalty function through the high-dimensional space. However, the minimization of LNML entails two major challenges: 1) the computation of the normalizing factor of LNML and 2) its minimization in high-dimensional spaces. In this paper, we present a novel regularization selection method (MDL-RS), in which a tight upper bound of LNML (uLNML) is minimized with local convergence guarantee. Our main contribution is the derivation of uLNML, which is a uniform-gap upper bound of LNML in an analytic expression. This solves the above challenges in an approximate manner because it allows us to accurately approximate LNML and then efficiently minimize it. The experimental results show that MDL-RS improves the generalization performance of regularized estimates specifically when the model has redundant parameters.