The concept of free energy has its origins in 19th century thermodynamics, but has recently found its way into the behavioral and neural sciences, where it has been promoted for its wide applicability and has even been suggested as a fundamental principle of understanding intelligent behavior and brain function. We argue that there are essentially two different notions of free energy in current models of intelligent agency, that can both be considered as applications of Bayesian inference to the problem of action selection: one that appears when trading off accuracy and uncertainty based on a general maximum entropy principle, and one that formulates action selection in terms of minimizing an error measure that quantifies deviations of beliefs and policies from given reference models. The first approach provides a normative rule for action selection in the face of model uncertainty or when information-processing capabilities are limited. The second approach directly aims to formulate the action selection problem as an inference problem in the context of Bayesian brain theories, also known as Active Inference in the literature. We elucidate the main ideas and discuss critical technical and conceptual issues revolving around these two notions of free energy that both claim to apply at all levels of decision-making, from the high-level deliberation of reasoning down to the low-level information-processing of perception.
Design of experiments, random search, initialization of population-based methods, or sampling inside an epoch of an evolutionary algorithm use a sample drawn according to some probability distribution for approximating the location of an optimum. Recent papers have shown that the optimal search distribution, used for the sampling, might be more peaked around the center of the distribution than the prior distribution modelling our uncertainty about the location of the optimum. We confirm this statement, provide explicit values for this reshaping of the search distribution depending on the population size $\lambda$ and the dimension $d$, and validate our results experimentally.
We give safe screening rules to eliminate variables from regression with l 0 regularization or cardinality constraint. These rules are based on guarantees that a feature may or may not be selected in an optimal solution. The screening rules can be computed from a convex relaxation solution in linear time, without solving the l 0 optimization problem. Thus, they can be used in a preprocessing step to safely remove variables from consideration apriori. Numerical experiments on real and synthetic data indicate that, on average, 76% of the variables can be fixed to their optimal values, hence, reducing the computational burden for optimization substantially. Therefore, the proposed fast and effective screening rules extend the scope of algorithms for l 0-regression to larger data sets.
We develop a method for computing policies in Markov decision processes with risk-sensitive measures subject to temporal logic constraints. Specifically, we use a particular risk-sensitive measure from cumulative prospect theory, which has been previously adopted in psychology and economics. The nonlinear transformation of the probabilities and utility functions yields a nonlinear programming problem, which makes computation of optimal policies typically challenging. We show that this nonlinear weighting function can be accurately approximated by the difference of two convex functions. This observation enables efficient policy computation using convex-concave programming. We demonstrate the effectiveness of the approach on several scenarios.
Artificial intelligence has the potential to greatly simplify our lives – but not everyone is a data scientist and not all data scientists are experts in machine learning. Enter AutoAI – a novel approach of designing, training and optimizing machine learning models automatically. With AutoAI, anyone could soon build machine learning pipelines from raw data directly, without writing complex code and performing tedious tuning and optimization, to then automate complicated, labor-intensive tasks. Several IBM papers selected for the AAAI-20 conference in New York demonstrate the value of AutoAI and different approaches to it in great detail. Most AutoAI research currently focuses on three areas: automatically determining the best models for each step of the desired machine learning and data science pipeline (model selection), automatically finding the best architecture of a deep learning-based AI model, and automatically finding the best hyperparameters (parameters for the model training process) for AI models and algorithms.
We present the viewpoint that optimization problems encountered in machine learning can often be interpreted as minimizing a convex functional over a function space, but with a non-convex constraint set introduced by model parameterization. This observation allows us to repose such problems via a suitable relaxation as convex optimization problems in the space of distributions over the training parameters. We derive some simple relationships between the distribution-space problem and the original problem, e.g. a distribution-space solution is at least as good as a solution in the original space. Moreover, we develop a numerical algorithm based on mixture distributions to perform approximate optimization directly in distribution space. Consistency of this approximation is established and the numerical efficacy of the proposed algorithm is illustrated on simple examples. In both theory and practice, this formulation provides an alternative approach to large-scale optimization in machine learning.
Recent works in high-dimensional model-predictive control and model-based reinforcement learning with learned dynamics and reward models have resorted to population-based optimization methods, such as the Cross-Entropy Method (CEM), for planning a sequence of actions. To decide on an action to take, CEM conducts a search for the action sequence with the highest return according to the dynamics model and reward. Action sequences are typically randomly sampled from an unconditional Gaussian distribution and evaluated on the environment. This distribution is iteratively updated towards action sequences with higher returns. However, this planning method can be very inefficient, especially for high-dimensional action spaces. An alternative line of approaches optimize action sequences directly via gradient descent, but are prone to local optima. We propose a method to solve this planning problem by interleaving CEM and gradient descent steps in optimizing the action sequence. Our experiments show faster convergence of the proposed hybrid approach, even for high-dimensional action spaces, avoidance of local minima, and better or equal performance to CEM. Code accompanying the paper is available here 1 .
Ensemble-based methods are highly popular approaches that increase the accuracy of a decision by aggregating the opinions of individual voters. The common point is to maximize accuracy; however, a natural limitation occurs if incremental costs are also assigned to the individual voters. Consequently, we investigate creating ensembles under an additional constraint on the total cost of the members. This task can be formulated as a knapsack problem, where the energy is the ensemble accuracy formed by some aggregation rules. However, the generally applied aggregation rules lead to a nonseparable energy function, which takes the common solution tools -- such as dynamic programming -- out of action. We introduce a novel stochastic approach that considers the energy as the joint probability function of the member accuracies. This type of knowledge can be efficiently incorporated in a stochastic search process as a stopping rule, since we have the information on the expected accuracy or, alternatively, the probability of finding more accurate ensembles. Experimental analyses of the created ensembles of pattern classifiers and object detectors confirm the efficiency of our approach. Moreover, we propose a novel stochastic search strategy that better fits the energy, compared with general approaches such as simulated annealing.
Information granules have been considered to be the fundamental constructs of Granular Computing (GrC). As a useful unsupervised learning technique, Fuzzy C-Means (FCM) is one of the most frequently used methods to construct information granules. The FCM-based granulation-degranulation mechanism plays a pivotal role in GrC. In this paper, to enhance the quality of the degranulation (reconstruction) process, we augment the FCM-based degranulation mechanism by introducing a vector of fuzzification factors (fuzzification factor vector) and setting up an adjustment mechanism to modify the prototypes and the partition matrix. The design is regarded as an optimization problem, which is guided by a reconstruction criterion. In the proposed scheme, the initial partition matrix and prototypes are generated by the FCM. Then a fuzzification factor vector is introduced to form an appropriate fuzzification factor for each cluster to build up an adjustment scheme of modifying the prototypes and the partition matrix. With the supervised learning mode of the granulation-degranulation process, we construct a composite objective function of the fuzzification factor vector, the prototypes and the partition matrix. Subsequently, the particle swarm optimization (PSO) is employed to optimize the fuzzification factor vector to refine the prototypes and develop the optimal partition matrix. Finally, the reconstruction performance of the FCM algorithm is enhanced. We offer a thorough analysis of the developed scheme. In particular, we show that the classical FCM algorithm forms a special case of the proposed scheme. Experiments completed for both synthetic and publicly available datasets show that the proposed approach outperforms the generic data reconstruction approach.
Local learning methods are a popular class of machine learning algorithms. The basic idea for the entire cadre is to choose some non-local model family, to train many of them on small sections of neighboring data, and then to `stitch' the resulting models together in some way. Due to the limits of constraining a training dataset to a small neighborhood, research on locally-learned models has largely been restricted to simple model families. Also, since simple model families have no complex structure by design, this has limited use of the individual local models to predictive tasks. We hypothesize that, using a sufficiently complex local model family, various properties of the individual local models, such as their learned parameters, can be used as features for further learning. This dissertation improves upon the current state of research and works toward establishing this hypothesis by investigating algorithms for localization of more complex model families and by studying their applications beyond predictions as a feature extraction mechanism. We summarize this generic technique of using local models as a feature extraction step with the term ``local model feature transformations.'' In this document, we extend the local modeling paradigm to Gaussian processes, orthogonal quadric models and word embedding models, and extend the existing theory for localized linear classifiers. We then demonstrate applications of local model feature transformations to epileptic event classification from EEG readings, activity monitoring via chest accelerometry, 3D surface reconstruction, 3D point cloud segmentation, handwritten digit classification and event detection from Twitter feeds.