Goto

Collaborating Authors

 Country


Generalized Residual Ratio Thresholding

arXiv.org Machine Learning

Simultaneous orthogonal matching pursuit (SOMP) and block OMP (BOMP) are two widely used techniques for sparse support recovery in multiple measurement vector (MMV) and block sparse (BS) models respectively. For optimal performance, both SOMP and BOMP require \textit{a priori} knowledge of signal sparsity or noise variance. However, sparsity and noise variance are unavailable in most practical applications. This letter presents a novel technique called generalized residual ratio thresholding (GRRT) for operating SOMP and BOMP without the \textit{a priori} knowledge of signal sparsity and noise variance and derive finite sample and finite signal to noise ratio (SNR) guarantees for exact support recovery. Numerical simulations indicate that GRRT performs similar to BOMP and SOMP with \textit{a priori} knowledge of signal and noise statistics.


Comparison of Classification Methods for Very High-Dimensional Data in Sparse Random Projection Representation

arXiv.org Machine Learning

Machine learning is a mature scientific field with lots of theoretical results, established algorithms and processes that address various supervised and unsupervised problems using the provided data. In theoretical research, such data is generated in a convenient way, or various methods are compared on standard benchmark problems - where data samples are represented as dense real-valued vectors of fixed and relatively low length. Practical applications represented by such standard datasets can successfully be solved by one of a myriad of existing machine learning methods and their implementations. However, the most impact of machine learning is currently in the big data field with the problems that are well explained in natural language ("Find malicious files", "Is that website safe to browse?") but are hard to encode numerically. Data samples in these problems have distinct features coming from a huge unordered set of possible features. Same approach can cover a frequent case of missing feature values [10, 28].


The Brier Score under Administrative Censoring: Problems and Solutions

arXiv.org Machine Learning

Box 1053 Blindern 0316 Oslo, Norway Abstract The Brier score is commonly used for evaluating probability predictions. In survival analysis, with right-censored observations of the event times, this score can be weighted by the inverse probability of censoring (IPCW) to retain its original interpretation. It is common practice to estimate the censoring distribution with the Kaplan-Meier estimator, even though it assumes that the censoring distribution is independent of the covariates. This paper discusses the general impact of the censoring estimates on the Brier score and shows that the estimation of the censoring distribution can be problematic. In particular, when the censoring times can be identified from the covariates, the IPCW score is no longer valid. For administratively censored data, where the potential censoring times are known for all individuals, we propose an alternative version of the Brier score. This administrative Brier score does not require estimation of the censoring distribution and is valid even if the censoring times can be identified from the covariates. Keywords: survival analysis, time-to-event-prediction, customer churn, inverse probability weighting, progressive type I censoring 1. Introduction Recently, there has been an increasing interest in combining machine learning methodology with survival analysis for improved time-to-event prediction. Also worth mentioning is the Random Survival Forest (Ishwaran et al., 2008) which makes decision trees based on the log-rank test and estimates the cumulative hazards with the Nelson-Aalen estimator. Although these methods are available for right-censored event times, a substantial part of the machine learning community is not familiar with survival analysis and might find it reasonable to instead apply binary classifiers for time-to-event prediction. In short, a binary classifier estimates the probability that an individual experience the event by time t, and can be fitted by disregarding individuals censored before that time. Arguably, the two most common evaluation criteria for survival predictions are the inverse probability of censoring weighted (IPCW) Brier score (Graf et al., 1999; Gerds and Schumacher, 2006) and different versions of the concordance index (Harrell Jr et al., 1982; Antolini et al., 2005; Uno et al., 2011; Gerds et al., 2013).


Analytic expressions for the output evolution of a deep neural network

arXiv.org Machine Learning

Anastasia Borovykh December 19, 2019 Abstract We present a novel methodology based on a Taylor expansion of the network output for obtaining analytical expressions for the expected value of the network weights and output under stochastic training. Using these analytical expressions the effects of the hyperparameters and the noise variance of the optimization algorithm on the performance of the deep neural network are studied. In the early phases of training with a small noise coefficient, the output is equivalent to a linear model. In this case the network can generalize better due to the noise preventing the output from fully converging on the train data, however the noise does not result in any explicit regularization. In the later training stages, when higher order approximations are required, the impact of the noise becomes more significant, i.e. in a model which is nonlinear in the weights noise can regularize the output function resulting in better generalization as witnessed by its influence on the weight Hessian, a commonly used metric for generalization capabilities. Keywords: deep learning; Taylor expansion; stochastic gradient descent; regularization; generalization 1 Introduction With the large number of applications which are nowadays in some way using deep learning, it is of significant value to gain insight into the output evolution of a deep neural network and the effects that the model architecture and optimization algorithm have on it. A deep neural network is a complex model due to the nonlinear dependencies and the large number of parameters in the model. Understanding the network output and its generalization capabilities, i.e. how well a model optimized on train data will be able to perform on unseen test data, is thus a complex task. One way of gaining insight into the network is by studying it in a large-parameter limit, a setting in which its dynamics becomes analytically tractable. Such limits have been considered in e.g. The generalization capabilities and the definition of various quantities that measure these have been studied extensively. Previous work has shown that the norm [3], [27], [19], the width of a minimum in weight space [11], [34], the input sensitivity [28] and a model's compressibility [2] can be related (either theoretically or in practice) to the model's complexity and thus its ability to perform well on unseen data. Furthermore, it has been noted that the generalization capabilities can be influenced by the optimization algorithm used to train the model, e.g. it can be used to bias the model into configurations that are more robust to noise and have lower model complexity, see e.g. Furthermore, it has been observed that certain parameters of stochastic gradient descent (SGD) can be used to control the generalization error and the data fit, see e.g.


Distributional Reinforcement Learning for Energy-Based Sequential Models

arXiv.org Machine Learning

Global Autoregressive Models (GAMs) are a recent proposal [Parshakova et al., CoNLL 2019] for exploiting global properties of sequences for data-efficient learning of seq2seq models. In the first phase of training, an Energy-Based model (EBM) over sequences is derived. This EBM has high representational power, but is unnormalized and cannot be directly exploited for sampling. To address this issue [Parshakova et al., CoNLL 2019] proposes a distillation technique, which can only be applied under limited conditions. By relating this problem to Policy Gradient techniques in RL, but in a \emph{distributional} rather than \emph{optimization} perspective, we propose a general approach applicable to any sequential EBM. Its effectiveness is illustrated on GAM-based experiments.


Provable Non-Convex Optimization and Algorithm Validation via Submodularity

arXiv.org Machine Learning

Submodularity is one of the most well-studied properties of problem classes in combinatorial optimization and many applications of machine learning and data mining, with strong implications for guaranteed optimization. In this thesis, we investigate the role of submodularity in provable non-convex optimization and validation of algorithms. A profound understanding which classes of functions can be tractably optimized remains a central challenge for non-convex optimization. By advancing the notion of submodularity to continuous domains (termed "continuous submodularity"), we characterize a class of generally non-convex and non-concave functions -- continuous submodular functions, and derive algorithms for approximately maximizing them with strong approximation guarantees. Meanwhile, continuous submodularity captures a wide spectrum of applications, ranging from revenue maximization with general marketing strategies, MAP inference for DPPs to mean field inference for probabilistic log-submodular models, which renders it as a valuable domain knowledge in optimizing this class of objectives. Validation of algorithms is an information-theoretic framework to investigate the robustness of algorithms to fluctuations in the input/observations and their generalization ability. We investigate various algorithms for one of the paradigmatic unconstrained submodular maximization problem: MaxCut. Due to submodularity of the MaxCut objective, we are able to present efficient approaches to calculate the algorithmic information content of MaxCut algorithms. The results provide insights into the robustness of different algorithmic techniques for MaxCut.


Neural networks and kernel ridge regression for excited states dynamics of CH$_2$NH$_2^+$: From single-state to multi-state representations and multi-property machine learning models

arXiv.org Machine Learning

Excited-state dynamics simulations are a powerful tool to investigate photo-induced reactions of molecules and materials and provide complementary information to experiments. Since the applicability of these simulation techniques is limited by the costs of the underlying electronic structure calculations, we develop and assess different machine learning models for this task. The machine learning models are trained on {\emph ab initio} calculations for excited electronic states, using the methylenimmonium cation (CH$_2$NH$_2^+$) as a model system. For the prediction of excited-state properties, multiple outputs are desirable, which is straightforward with neural networks but less explored with kernel ridge regression. We overcome this challenge for kernel ridge regression in the case of energy predictions by encoding the electronic states explicitly in the inputs, in addition to the molecular representation. We adopt this strategy also for our neural networks for comparison. Such a state encoding enables not only kernel ridge regression with multiple outputs but leads also to more accurate machine learning models for state-specific properties. An important goal for excited-state machine learning models is their use in dynamics simulations, which needs not only state-specific information but also couplings, i.e., properties involving pairs of states. Accordingly, we investigate the performance of different models for such coupling elements. Furthermore, we explore how combining all properties in a single neural network affects the accuracy. As an ultimate test for our machine learning models, we carry out excited-state dynamics simulations based on the predicted energies, forces and couplings and, thus, show the scopes and possibilities of machine learning for the treatment of electronically excited states.


Inverse Graph Learning over Optimization Networks

arXiv.org Machine Learning

Many inferential and learning tasks can be accomplished efficiently by means of distributed optimization algorithms where the network topology plays a critical role in driving the local interactions among neighboring agents. There is a large body of literature examining the effect of the graph structure on the performance of optimization strategies. In this article, we examine the inverse problem and consider the reverse question: How much information does observing the behavior at the nodes convey about the underlying network structure used for optimization? Over large-scale networks, the difficulty of addressing such inverse questions (or problems) is compounded by the fact that usually only a limited portion of nodes can be probed, giving rise to a second important question: Despite the presence of several unobserved nodes, are partial and local observations still sufficient to discover the graph linking the probed nodes? The article surveys recent advances on this inverse learning problem and related questions. Examples of applications are provided to illustrate how the interplay between graph learning and distributed optimization arises in practice, e.g., in cognitive engineered systems such as distributed detection, or in other real-world problems such as the mechanism of opinion formation over social networks and the mechanism of coordination in biological networks. A unifying framework for examining the reconstruction error will be described, which allows to devise and examine various estimation strategies enabling successful graph learning. The relevance of specific network attributes, such as sparsity versus density of connections, and node degree concentration, is discussed in relation to the topology inference goal. It is shown how universal (i.e., data-driven) clustering algorithms can be exploited to solve the graph learning problem.


Tree pyramidal adaptive importance sampling

arXiv.org Machine Learning

This paper introduces Tree-Pyramidal Adaptive Importance Sampling (TP-AIS), a novel iterated sampling method that outperforms current state-of-the-art approaches. TP-AIS iteratively builds a proposal distribution parameterized by a tree pyramid, where each tree leaf spans a convex subspace and represents it's importance density. After each new sample operation, a set of tree leaves are subdivided improving the approximation of the proposal distribution to the target density. Unlike the rest of the methods in the literature, TP-AIS is parameter free and requires zero manual tuning to achieve its best performance. Our proposed method is evaluated with different complexity randomized target probability density functions and also analyze its application to different dimensions. The results are compared to state-of-the-art iterative importance sampling approaches and other baseline MCMC approaches using Normalized Effective Sample Size (N-ESS), Jensen-Shannon Divergence to the target posterior, and time complexity.


Preventing Information Leakage with Neural Architecture Search

arXiv.org Machine Learning

Powered by machine learning services in the cloud, numerous learning-driven mobile applications are gaining popularity in the market. As deep learning tasks are mostly computation-intensive, it has become a trend to process raw data on devices and send the neural network features to the cloud, whereas the part of the neural network residing in the cloud completes the task to return final results. However, there is always the potential for unexpected leakage with the release of features, with which an adversary could infer a significant amount of information about the original data. To address this problem, we propose a privacy-preserving deep learning framework on top of the mobile cloud infrastructure: the trained deep neural network is tailored to prevent information leakage through features while maintaining highly accurate results. In essence, we learn the strategy to prevent leakage by modifying the trained deep neural network against a generic opponent, who infers unintended information from released features and auxiliary data, while preserving the accuracy of the model as much as possible.