Regression
Logistic Regression and Maximum Entropy explained with examples and code
Logistic Regression is one of the most powerful classification methods within machine learning and can be used for a wide variety of tasks. Think of pre-policing or predictive analytics in health; it can be used to aid tuberculosis patients, aid breast cancer diagnosis, etc. Think of modeling urban growth, analysing mortgage pre-payments and defaults, forecasting the direction and strength of stock market movement, and even predicting sport outcomes. Reading all of this, the theory[1] of Maximum Entropy Classification might look difficult. In my experience, the average Developer does not believe they can design a proper Maximum Entropy / Logistic Regression Classifier from scratch. I strongly disagree: not only is the mathematics behind is relatively simple, it can also be implemented with a few lines of code.
Ordinal regression - Wikipedia, the free encyclopedia
In statistics, ordinal regression (also called "ordinal classification") is a type of regression analysis used for predicting an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between different values is significant. It can be considered an intermediate problem in between (metric) regression and classification.[1] Ordinal regression turns up often in the social sciences, for example in the modeling of human levels of preference (on a scale from, say, 1–5 for "very poor" through "excellent"), as well as in information retrieval. In machine learning, ordinal regression may also be called ranking learning.[2][a] Ordinal regression can be performed using a generalized linear model (GLM) that fits both a coefficient vector and a set of thresholds to a dataset.
Scaling_synthesized_data
In particular, I checked out the k-Nearest Neighbors (k-NN) and logistic regression algorithms and saw how scaling numerical data strongly influenced the performance of the former but not that of the latter, as measured, for example, by accuracy (see Glossary below or previous articles for definitions of scaling, k-NN and other relevant terms). The real take home message here was that preprocessing doesn't occur in a vacuum, that is, you can prepocess the heck out of your data but the proof is in the pudding: how well does your model then perform? Scaling numerical data (that is, multiplying all instances of a variable by a constant in order to change that variable's range) has two related purposes: i) if your measurements are in meters and mine are in miles, then, if we both scale our data, they end up being the same & ii) if two variables have vastly different ranges, the one with the larger range may dominate your predictive model, even though it may be less important to your target variable than the variable with the smaller range. What we saw is that this problem identified in ii) occurs with k-NN, which explicitly looks at how close data are to one another but not in logistic regression which, when being trained, will shrink the relevant coefficient to account for the lack of scaling. As the data we used in the previous articles was real-world data, all we could see was how the models performed before and after scaling.
Bayesian Optimization for Hyperparameter Tuning - Arimo
Bayesian Optimization helped us find a hyperparameter configuration that is better than the one found by Random Search for a neural network on the San Francisco Crimes dataset. People who are familiar with Machine Learning might want to fast forward to Section 3 for details. The code to reproduce the experiments can be found here. Hyperparameter tuning may be one of the most tricky, yet interesting, topics in Machine Learning. For most Machine Learning practitioners, mastering the art of tuning hyperparameters requires not only a solid background in Machine Learning algorithms, but also extensive experience working with real-world datasets.
Learning the kernel matrix via predictive low-rank approximations
Efficient and accurate low-rank approximations of multiple data sources are essential in the era of big data. The scaling of kernel-based learning algorithms to large datasets is limited by the O(n^2) computation and storage complexity of the full kernel matrix, which is required by most of the recent kernel learning algorithms. We present the Mklaren algorithm to approximate multiple kernel matrices learn a regression model, which is entirely based on geometrical concepts. The algorithm does not require access to full kernel matrices yet it accounts for the correlations between all kernels. It uses Incomplete Cholesky decomposition, where pivot selection is based on least-angle regression in the combined, low-dimensional feature space. The algorithm has linear complexity in the number of data points and kernels. When explicit feature space induced by the kernel can be constructed, a mapping from the dual to the primal Ridge regression weights is used for model interpretation. The Mklaren algorithm was tested on eight standard regression datasets. It outperforms contemporary kernel matrix approximation approaches when learning with multiple kernels. It identifies relevant kernels, achieving highest explained variance than other multiple kernel learning methods for the same number of iterations. Test accuracy, equivalent to the one using full kernel matrices, was achieved with at significantly lower approximation ranks. A difference in run times of two orders of magnitude was observed when either the number of samples or kernels exceeds 3000.
Identification of refugee influx patterns in Greece via model-theoretic analysis of daily arrivals
The refugee crisis is perhaps the single most challenging problem for Europe today. Hundreds of thousands of people have already traveled across dangerous sea passages from Turkish shores to Greek islands, resulting in thousands of dead and missing, despite the best rescue efforts from both sides. One of the main reasons is the total lack of any early warning-alerting system, which could provide some preparation time for the prompt and effective deployment of resources at the hot zones. This work is such an attempt for a systemic analysis of the refugee influx in Greece, aiming at (a) the statistical and signal-level characterization of the smuggling networks and (b) the formulation and preliminary assessment of such models for predictive purposes, i.e., as the basis of such an early warning-alerting protocol. To our knowledge, this is the first-ever attempt to design such a system, since this refugee crisis itself and its geographical properties are unique (intense event handling, little or no warning). The analysis employs a wide range of statistical, signal-based and matrix factorization (decomposition) techniques, including linear & linear-cosine regression, spectral analysis, ARMA, SVD, Probabilistic PCA, ICA, K-SVD for Dictionary Learning, as well as fractal dimension analysis. It is established that the behavioral patterns of the smuggling networks closely match (as expected) the regular burst and pause periods of store-and-forward networks in digital communications. There are also major periodic trends in the range of 6.2-6.5 days and strong correlations in lags of four or more days, with distinct preference in the Sunday-Monday 48-hour time frame. These results show that such models can be used successfully for short-term forecasting of the influx intensity, producing an invaluable operational asset for planners, decision-makers and first-responders.
Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality
Braverman, Mark, Garg, Ankit, Ma, Tengyu, Nguyen, Huy L., Woodruff, David P.
We study the tradeoff between the statistical error and communication cost of distributed statistical estimation problems in high dimensions. In the distributed sparse Gaussian mean estimation problem, each of the $m$ machines receives $n$ data points from a $d$-dimensional Gaussian distribution with unknown mean $\theta$ which is promised to be $k$-sparse. The machines communicate by message passing and aim to estimate the mean $\theta$. We provide a tight (up to logarithmic factors) tradeoff between the estimation error and the number of bits communicated between the machines. This directly leads to a lower bound for the distributed \textit{sparse linear regression} problem: to achieve the statistical minimax error, the total communication is at least $\Omega(\min\{n,d\}m)$, where $n$ is the number of observations that each machine receives and $d$ is the ambient dimension. These lower results improve upon [Sha14,SD'14] by allowing multi-round iterative communication model. We also give the first optimal simultaneous protocol in the dense case for mean estimation. As our main technique, we prove a \textit{distributed data processing inequality}, as a generalization of usual data processing inequalities, which might be of independent interest and useful for other problems.
Regression, Logistic Regression and Maximum Entropy part 2 (code examples) – Ahmet Taspinar
In the previous blog we have seen the theory and mathematics behind the Maximum Entropy and Logistic Regression Classifiers. Logistic Regression is one of the most powerful classification methods within machine learning and can be used for a wide variety of tasks. Think of pre-policing or predictive analytics in health; it can be used to aid tuberculosis patients, aid breast cancer diagnosis, etc. Think of modeling urban growth, analysing mortgage pre-payments and defaults, forecasting the direction and strength of stock market movement, and even sports. Reading all of this, the theory[1] of Maximum Entropy Classification might look difficult. In my experience, the average Developer does not believe they can design a proper Maximum Entropy / Logistic Regression Classifier from scratch.
Tweet Acts: A Speech Act Classifier for Twitter
Vosoughi, Soroush (Massachusetts Institute of Technology) | Roy, Deb (Massachusetts Institute of Technology)
Speech acts are a way to conceptualize speech as action. This holds true for communication on any platform, including social media platforms such as Twitter. In this paper, we explored speech act recognition on Twitter by treating it as a multi-class classification problem. We created a taxonomy of six speech acts for Twitter and proposed a set of semantic and syntactic features. We trained and tested a logistic regression classifier using a data set of manually labelled tweets. Our method achieved a state-of-the-art performance with an average F1 score of more than 0.70. We also explored classifiers with three different granularities (Twitter-wide, type-specific and topic-specific) in order to find the right balance between generalization and overfitting for our task.