Uncertainty
From Credit Assignment to Entropy Regularization: Two New Algorithms for Neural Sequence Prediction
Dai, Zihang, Xie, Qizhe, Hovy, Eduard
In this work, we study the credit assignment problem in reward augmented maximum likelihood (RAML) learning, and establish a theoretical equivalence between the token-level counterpart of RAML and the entropy regularized reinforcement learning. Inspired by the connection, we propose two sequence prediction algorithms, one extending RAML with fine-grained credit assignment and the other improving Actor-Critic with a systematic entropy regularization. On two benchmark datasets, we show the proposed algorithms outperform RAML and Actor-Critic respectively, providing new alternatives to sequence prediction.
Learning Data Dependency with Communication Cost
Jang, Hyeryung, Song, HyungSeok, Yi, Yung
In this paper, we consider the problem of recovering a graph that represents the statistical data dependency among nodes for a set of data samples generated by nodes, which provides the basic structure to perform an inference task, such as MAP (maximum a posteriori). This problem is referred to as structure learning. When nodes are spatially separated in different locations, running an inference algorithm requires a non-negligible amount of message passing, incurring some communication cost. We inevitably have the trade-off between the accuracy of structure learning and the cost we need to pay to perform a given message-passing based inference task because the learnt edge structures of data dependency and physical connectivity graph are often highly different. In this paper, we formalize this trade-off in an optimization problem which outputs the data dependency graph that jointly considers learning accuracy and message-passing costs. We focus on a distributed MAP as the target inference task, and consider two different implementations, ASYNC-MAP and SYNC-MAP that have different message-passing mechanisms and thus different cost structures. In ASYNC- MAP, we propose a polynomial time learning algorithm that is optimal, motivated by the problem of finding a maximum weight spanning tree. In SYNC-MAP, we first prove that it is NP-hard and propose a greedy heuristic. For both implementations, we then quantify how the probability that the resulting data graphs from those learning algorithms differ from the ideal data graph decays as the number of data samples grows, using the large deviation principle, where the decaying rate is characterized by some topological structures of both original data dependency and physical connectivity graphs as well as the degree of the trade-off. We validate our theoretical findings through extensive simulations, which confirms that it has a good match.
Generalized Logical Operations among Conditional Events
Gilio, Angelo, Sanfilippo, Giuseppe
We generalize, by a progressive procedure, the notions of conjunction and disjunction of two conditional events to the case of $n$ conditional events. In our coherence-based approach, conjunctions and disjunctions are suitable conditional random quantities. We define the notion of negation, by verifying De Morgan's Laws. We also show that conjunction and disjunction satisfy the associative and commutative properties, and a monotonicity property. Then, we give some results on coherence of prevision assessments for some families of compounded conditionals; in particular we examine the Fr\'echet-Hoeffding bounds. Moreover, we study the reverse probabilistic inference from the conjunction $\mathcal{C}_{n+1}$ of $n+1$ conditional events to the family $\{\mathcal{C}_{n},E_{n+1}|H_{n+1}\}$. We consider the relation with the notion of quasi-conjunction and we examine in detail the coherence of the prevision assessments related with the conjunction of three conditional events. Based on conjunction, we also give a characterization of p-consistency and of p-entailment, with applications to several inference rules in probabilistic nonmonotonic reasoning. Finally, we examine some non p-valid inference rules; then, we illustrate by an example two methods which allow to suitably modify non p-valid inference rules in order to get inferences which are p-valid.
The Logical Essentials of Bayesian Reasoning
This chapter offers an accessible introduction to the channel-based approach to Bayesian probability theory. This framework rests on algebraic and logical foundations, inspired by the methodologies of programming language semantics. It offers a uniform, structured and expressive language for describing Bayesian phenomena in terms of familiar programming concepts, like channel, predicate transformation and state transformation. The introduction also covers inference in Bayesian networks, which will be modelled by a suitable calculus of string diagrams.
A Hybrid Q-Learning Sine-Cosine-based Strategy for Addressing the Combinatorial Test Suite Minimization Problem
Zamli, Kamal Z., Din, Fakhrud, Ahmed, Bestoun S., Bures, Miroslav
The sine-cosine algorithm (SCA) is a new population-based meta-heuristic algorithm. In addition to exploiting sine and cosine functions to perform local and global searches (hence the name sine-cosine), the SCA introduces several random and adaptive parameters to facilitate the search process. Although it shows promising results, the search process of the SCA is vulnerable to local minima/maxima due to the adoption of a fixed switch probability and the bounded magnitude of the sine and cosine functions (from -1 to 1). In this paper, we propose a new hybrid Q-learning sine-cosine- based strategy, called the Q-learning sine-cosine algorithm (QLSCA). Within the QLSCA, we eliminate the switching probability. Instead, we rely on the Q-learning algorithm (based on the penalty and reward mechanism) to dynamically identify the best operation during runtime. Additionally, we integrate two new operations (L\'evy flight motion and crossover) into the QLSCA to facilitate jumping out of local minima/maxima and enhance the solution diversity. To assess its performance, we adopt the QLSCA for the combinatorial test suite minimization problem. Experimental results reveal that the QLSCA is statistically superior with regard to test suite size reduction compared to recent state-of-the-art strategies, including the original SCA, the particle swarm test generator (PSTG), adaptive particle swarm optimization (APSO) and the cuckoo search strategy (CS) at the 95% confidence level. However, concerning the comparison with discrete particle swarm optimization (DPSO), there is no significant difference in performance at the 95% confidence level. On a positive note, the QLSCA statistically outperforms the DPSO in certain configurations at the 90% confidence level.
An adaptive self-organizing fuzzy logic controller in a serious game for motor impairment rehabilitation
Esfahlani, Shabnam Sadeghi, Cirstea, Silvia, Sanaei, Alireza, Wilson, George
Rehabilitation robotics combined with video game technology provides a means of assisting in the rehabilitation of patients with neuromuscular disorders by performing various facilitation movements. The current work presents ReHabGame, a serious game using a fusion of implemented technologies that can be easily used by patients and therapists to assess and enhance sensorimotor performance and also increase the activities in the daily lives of patients. The game allows a player to control avatar movements through a Kinect Xbox, Myo armband and rudder foot pedal, and involves a series of reach-grasp-collect tasks whose difficulty levels are learnt by a fuzzy interface. The orientation, angular velocity, head and spine tilts and other data generated by the player are monitored and saved, whilst the task completion is calculated by solving an inverse kinematics algorithm which orientates the upper limb joints of the avatar. The different values in upper body quantities of movement provide fuzzy input from which crisp output is determined and used to generate an appropriate subsequent rehabilitation game level. The system can thus provide personalised, autonomously-learnt rehabilitation programmes for patients with neuromuscular disorders with superior predictions to guide the development of improved clinical protocols compared to traditional theraputic activities.
Negative Log Likelihood Ratio Loss for Deep Neural Network Classification
Zhu, Donglai, Yao, Hengshuai, Jiang, Bei, Yu, Peng
Deep neural network (DNN) has achieved remarkable success in classification tasks such as image classification [1]. The network output can mimic the posterior probabilities of target classes for the input observation when the nonlinear activation function in the output layer is defined as a soft-max function [2]. The learning objective is to minimize the difference between the predicted distribution and the true datagenerating distribution. In information theory, the cross entropy between two probability distributions over a common event set of events measures the average number of bits needed to identify an event if coding follows a learned probability distribution rather than the true but unknow distribution [3]. Therefore, cross entropy is a reasonable loss function for the DNN-based classification. However, in practice the true data-generating probability distribution is unknown and replaced by the empirical probability distribution over a training set where each sample is drawn independently and identically distributed (i.i.d.) from the data space [4]. Under assumptions of uniform distributions of feature and label spaces, minimizing cross-entropy is equivalent to maximum likelihood, i.e., the learning problem aims to maximize likelihood of correct class for each of training samples [2]. Maximum likelihood is a generative training criterion by which the model learns the likelihood of correct class for the observation. The model makes predictions by using Bayes rules to calculate posterior probabilities of target classes for the observation and then select the most likely class.
High-dimensional Penalty Selection via Minimum Description Length Principle
Miyaguchi, Kohei, Yamanishi, Kenji
We tackle the problem of penalty selection of regularization on the basis of the minimum description length (MDL) principle. In particular, we consider that the design space of the penalty function is high-dimensional. In this situation, the luckiness-normalized-maximum-likelihood(LNML)-minimization approach is favorable, because LNML quantifies the goodness of regularized models with any forms of penalty functions in view of the minimum description length principle, and guides us to a good penalty function through the high-dimensional space. However, the minimization of LNML entails two major challenges: 1) the computation of the normalizing factor of LNML and 2) its minimization in high-dimensional spaces. In this paper, we present a novel regularization selection method (MDL-RS), in which a tight upper bound of LNML (uLNML) is minimized with local convergence guarantee. Our main contribution is the derivation of uLNML, which is a uniform-gap upper bound of LNML in an analytic expression. This solves the above challenges in an approximate manner because it allows us to accurately approximate LNML and then efficiently minimize it. The experimental results show that MDL-RS improves the generalization performance of regularized estimates specifically when the model has redundant parameters.
Weak Labeling for Crowd Learning
Beรฑaran-Muรฑoz, Iker, Hernรกndez-Gonzรกlez, Jerรณnimo, Pรฉrez, Aritz
Crowdsourcing has become very popular among the machine learning community as a way to obtain labels that allow a ground truth to be estimated for a given dataset. In most of the approaches that use crowdsourced labels, annotators are asked to provide, for each presented instance, a single class label. Such a request could be inefficient, that is, considering that the labelers may not be experts, that way to proceed could fail to take real advantage of the knowledge of the labelers. In this paper, the use of weak labeling for crowd learning is proposed, where the annotators may provide more than a single label per instance to try not to miss the real label. The main hypothesis is that, by allowing weak labeling, knowledge can be extracted from the labelers more efficiently by than in the standard crowd learning scenario. Empirical evidence which supports that hypothesis is presented.
Decentralized learning with budgeted network load using Gaussian copulas and classifier ensembles
Klein, John, Albardan, Mahmoud, Guedj, Benjamin, Colot, Olivier
We examine a network of learners which address the same classification task but must learn from different data sets. The learners can share a limited portion of their data sets so as to preserve the network load. We introduce DELCO (standing for Decentralized Ensemble Learning with COpulas), a new approach in which the shared data and the trained models are sent to a central machine that allows to build an ensemble of classifiers. The proposed method aggregates the base classifiers using a probabilistic model relying on Gaussian copulas. Experiments on logistic regressor ensembles demonstrate competing accuracy and increased robustness as compared to gold standard approaches. A companion python implementation can be downloaded at https://github.com/john-klein/DELCO