Markov Models
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF
State-of-the-art sequence labeling systems traditionally require large amounts of task-specific knowledge in the form of handcrafted features and data pre-processing. In this paper, we introduce a novel neutral network architecture that benefits from both word-and character-level representations automatically, by using combination of bidirectional LSTM, CNN and CRF. Our system is truly end-to-end, requiring no feature engineering or data pre-processing, thus making it applicable to a wide range of sequence labeling tasks. We evaluate our system on two data sets for two sequence labeling tasks -- Penn Treebank WSJ corpus for part-of-speech (POS) tagging and CoNLL 2003 corpus for named entity recognition (NER). We obtain state-of-the-art performance on both datasets -- 97.55% accuracy for POS tagging and 91.21% F1 for NER. 1 Introduction Linguistic sequence labeling, such as part-of- speech (POS) tagging and named entity recognition (NER), is one of the first stages in deep language understanding and its importance has been well recognized in the natural language processing community. Most traditional high performance sequence labeling models are linear statistical models, including Hidden Markov Models (HMM) and Conditional Random Fields (CRF) (Ratinov and Roth, 2009; Passos et al., 2014; Luo et al., 2015), which rely heavily on handcrafted features and task-specific resources. For example, English POS taggers benefit from carefully designed word spelling features; orthographic features and external resources such as gazetteers are widely used in NER. However, such task-specific knowledge is costly to develop (Ma and Xia, 2014), making sequence labeling models difficult to adapt to new tasks or new domains. In the past few years, nonlinear neural networks with as input distributed word representations, also known as word embeddings, have been broadly applied to NLP problems with great success.
Variational Tempering
Mandt, Stephan, McInerney, James, Abrol, Farhan, Ranganath, Rajesh, Blei, David
Variational inference (VI) combined with data subsampling enables approximate posterior inference over large data sets, but suffers from poor local optima. We first formulate a deterministic annealing approach for the generic class of conditionally conjugate exponential family models. This approach uses a decreasing temperature parameter which deterministically deforms the objective during the course of the optimization. A well-known drawback to this annealing approach is the choice of the cooling schedule. We therefore introduce variational tempering, a variational algorithm that introduces a temperature latent variable to the model. In contrast to related work in the Markov chain Monte Carlo literature, this algorithm results in adaptive annealing schedules. Lastly, we develop local variational tempering, which assigns a latent temperature to each data point; this allows for dynamic annealing that varies across data. Compared to the traditional VI, all proposed approaches find improved predictive likelihoods on held-out data.
Let Me Hear Your Voice and I'll Tell You How You Feel
Creating mood sensing technology has become very popular in recent years. There is a wide range of companies trying to detect your emotions from what you write, the tone of your voice, or from the expressions on your face. All of these companies offer their technology online through cloud-based programming interfaces (APIs). As part of my offline emotion sensing hardware (Project Jammin), I have already built early prototypes of facial expression and speech content recognition for emotion detection. In this short article I describe the missing part, a voice tone analyzer.
Mastering Machine Learning With scikit-learn
If you are a software developer who wants to learn how machine learning models work and how to apply them effectively, this book is for you. Familiarity with machine learning fundamentals and Python will be helpful, but is not essential. This book examines machine learning models including logistic regression, decision trees, and support vector machines, and applies them to common problems such as categorizing documents and classifying images. It begins with the fundamentals of machine learning, introducing you to the supervised-unsupervised spectrum, the uses of training and test data, and evaluating models. You will learn how to use generalized linear models in regression problems, as well as solve problems with text and categorical features. You will be acquainted with the use of logistic regression, regularization, and the various loss functions that are used by generalized linear models.
Variational Bayesian Inference for Hidden Markov Models With Multivariate Gaussian Output Distributions
Gruhl, Christian, Sick, Bernhard
Hidden Markov Models (HMM) are a standard technique in time series analysis or data mining. Given a (set of) time series sample data, they are typically trained by means of a special variant of an expectation maximization (EM) algorithm, the Baum-Welch algorithm. HMM are used for gesture recognition, machine tool monitoring, or speech recognition, for instance. Second-order techniques are used to find values for parameters of probabilistic models from sample data. The parameters are regarded as random variables, and distributions are defined over these variables. These type of these second-order distributions depends on the type of the underlying probabilistic models. Typically, so called conjugate distributions are chosen, e.g., a Gaussian-Wishart distribution for an underlying Gaussian for which mean and covariance matrix have to be determined. Second-order techniques have some advantages over conventional approaches, e.g.,
Particle Metropolis-adjusted Langevin algorithms
Nemeth, Christopher, Sherlock, Chris, Fearnhead, Paul
Markov chain Monte Carlo algorithms are a popular and well-studied methodology that can be used to draw samples from posterior distributions. Over the past few years these algorithms have been extended to tackle problems where the model likelihood is intractable (Beaumont, 2003). Andrieu and Roberts (2009) showed that within the Metropolis-Hastings algorithm, if the likelihood is replaced with an unbiased estimate, then the sampler still targets the correct stationary distribution. Andrieu et al. (2010) extended this work further to create a class of 1 Markov chain algorithms that use sequential Monte Carlo methods, also known as particle filters. Current implementations of pseudo-marginal and particle Markov chain Monte Carlo use random-walk proposals to update the parameters (e.g., Golightly and Wilkinson, 2011; Knape and de Valpine, 2012) and shall be referred to herein as particle random-walk Metropolis algorithms. Random walk-based algorithms propose a new value from some symmetric density centred on the current value.
Structure Learning of Partitioned Markov Networks
Liu, Song, Suzuki, Taiji, Sugiyama, Masashi, Fukumizu, Kenji
We learn the structure of a Markov Network between two groups of random variables from joint observations. Since modelling and learning the full MN structure may be hard, learning the links between two groups directly may be a preferable option. We introduce a novel concept called the \emph{partitioned ratio} whose factorization directly associates with the Markovian properties of random variables across two groups. A simple one-shot convex optimization procedure is proposed for learning the \emph{sparse} factorizations of the partitioned ratio and it is theoretically guaranteed to recover the correct inter-group structure under mild conditions. The performance of the proposed method is experimentally compared with the state of the art MN structure learning methods using ROC curves. Real applications on analyzing bipartisanship in US congress and pairwise DNA/time-series alignments are also reported.
Partition Functions from Rao-Blackwellized Tempered Sampling
Carlson, David, Stinson, Patrick, Pakman, Ari, Paninski, Liam
Partition functions of probability distributions are important quantities for model evaluation and comparisons. We present a new method to compute partition functions of complex and multimodal distributions. Such distributions are often sampled using simulated tempering, which augments the target space with an auxiliary inverse temperature variable. Our method exploits the multinomial probability law of the inverse temperatures, and provides estimates of the partition function in terms of a simple quotient of Rao-Blackwellized marginal inverse temperature probability estimates, which are updated while sampling. We show that the method has interesting connections with several alternative popular methods, and offers some significant advantages. In particular, we empirically find that the new method provides more accurate estimates than Annealed Importance Sampling when calculating partition functions of large Restricted Boltzmann Machines (RBM); moreover, the method is sufficiently accurate to track training and validation log-likelihoods during learning of RBMs, at minimal computational cost.
Making data science accessible - Markov Chains
A Markov chain is a random process with the property that the next state depends only on the current state. For example: If you have the choice of red or blue twice the process would be Markovian if each time you chose the decision had nothing to do with your choice previously (see diagram below). How can Markov Chains help us? To start with we need to define some basic terminology. The changes of state within the system are called transitions, and the probabilities associated with various state-changes are called transition probabilities.
Semiparametric energy-based probabilistic models
Probabilistic models can be defined by an energy function, where the probability of each state is proportional to the exponential of the state's negative energy. This paper considers a generalization of energy-based models in which the probability of a state is proportional to an arbitrary positive, strictly decreasing, and twice differentiable function of the state's energy. The precise shape of the nonlinear map from energies to unnormalized probabilities has to be learned from data together with the parameters of the energy function. As a case study we show that the above generalization of a fully visible Boltzmann machine yields an accurate model of neural activity of retinal ganglion cells. We attribute this success to the model's ability to easily capture distributions whose probabilities span a large dynamic range, a possible consequence of latent variables that globally couple the system. Similar features have recently been observed in many datasets, suggesting that our new method has wide applicability.