Regression
Recurrent Transform Learning
Gupta, Megha, Majumdar, Angshul
The objective of this work is to improve the accuracy of building demand forecasting . This is a more challenging t ask than grid level forecasting. For the said purpose, we develop a new technique called recurrent transform learning (RTL). The first one (RTL) is unsupervised; this is used as a feature extraction tool that is further fed into a regression model. Forecasting experiments have been carried out on three popular publicly available datasets. Both of our proposed techniques yield results superior to the state - of - the - art like long short term memory network, echo state network and sparse coding regression. Index Terms -- demand forecasting, dynamical model, load forecasting, transform learning . H E impor tance of electrical load forecasting is well known. The issue has gained even more significance with the advent of smartgrids, microgrids and smart buildings. An excellent review on this topic can be found in [1].
Learn classification algorithms using Python and scikit-learn
This tutorial is part of the Machine learning for developers learning path. In this tutorial, we describe the basics of solving a classification-based machine learning problem, and give you a comparative study of some of the current most popular algorithms. In the open Notebook, click Run to run the cells one at a time. The rest of the tutorial follows the order of the Notebook. Classification is when the feature to be predicted contains categories of values.
Unsupervised Feature Selection based on Adaptive Similarity Learning and Subspace Clustering
Parsa, Mohsen Ghassemi, Zare, Hadi, Ghatee, Mehdi
Unsupervised Feature Selection based on Adaptive Similarity Learning and Subspace Clustering Mohsen Ghassemi Parsa a, Hadi Zare a,, Mehdi Ghatee b a Faculty of New Sciences and Technologies, University of Tehran, Iran b Department of Computer Science, Amirkabir University of Technology, IranAbstract Feature selection methods have an important role on the readability of data and the reduction of complexity of learning algorithms. In recent years, a variety of efforts are investigated on feature selection problems based on unsupervised viewpoint due to the laborious labeling task on large datasets. In this paper, we propose a novel approach on unsupervised feature selection initiated from the subspace clustering to preserve the similarities by representation learning of low dimensional subspaces among the samples. A self-expressive model is employed to implicitly learn the cluster similarities in an adaptive manner. The proposed method not only maintains the sample similarities through subspace clustering, but it also captures the discriminative information based on a regularized regression model. In line with the convergence analysis of the proposed method, the experimental results on benchmark datasets demonstrate the effectiveness of our approach as compared with the state of the art methods.
Fenton-Wilkinson Order Statistics and German Tanks: A Case Study of an Orienteering Relay Race
Ordinal regression falls between discrete-valued classification and continuous-valued regression. Ordinal target variables can be associated with ranked random variables. These random variables are known as order statistics and they are closely related to ordinal regression. However, the challenge of using order statistics for ordinal regression prediction is finding a suitable parent distribution. In this work, we provide a case study of a real-world orienteering relay race by viewing it as a random process. For this process, we show that accurate order statistical ordinal regression predictions of final team rankings, or places, can be obtained by assuming a lognormal distribution of individual leg times. Moreover, we apply Fenton-Wilkinson approximations to intermediate changeover times alongside an estimator for the total number of teams as in the notorious German tank problem. The purpose of this work is, in part, to spark interest in studying the applicability of order statistics in ordinal regression problems.
Exact expressions for double descent and implicit regularization via surrogate random design
Dereziński, Michał, Liang, Feynman, Mahoney, Michael W.
Double descent refers to the phase transition that is exhibited by the generalization error of unregularized learning models when varying the ratio between the number of parameters and the number of training samples. The recent success of highly over-parameterized machine learning models such as deep neural networks has motivated a theoretical analysis of the double descent phenomenon in classical models such as linear regression which can also generalize well in the over-parameterized regime. We build on recent advances in Randomized Numerical Linear Algebra (RandNLA) to provide the first exact non-asymptotic expressions for double descent of the minimum norm linear estimator. Our approach involves constructing what we call a surrogate random design to replace the standard i.i.d. design of the training sample. This surrogate design admits exact expressions for the mean squared error of the estimator while preserving the key properties of the standard design. We also establish an exact implicit regularization result for over-parameterized training samples. In particular, we show that, for the surrogate design, the implicit bias of the unregularized minimum norm estimator precisely corresponds to solving a ridge-regularized least squares problem on the population distribution.
PySpark for Data Science Workflows
Demonstrated experience in PySpark is one of the most desirable competencies that employers are looking for when building data science teams, because it enables these teams to own live data products. While I've previously blogged about PySpark, Parallelization, and UDFs, I wanted to provide a proper overview of this topic as a book chapter. I'm sharing this complete chapter, because I want to encourage the adoption of PySpark as a tool for data scientists. All code examples from this post are available here, and all prerequisites are covered in the sample chapters here. You might want to grab some snacks before diving in! Spark is a general-purpose computing framework that can scale to massive data volumes. It builds upon prior big data tools such as Hadoop and MapReduce, while providing significant improvements in the expressivity of the languages it supports. One of the core components of Spark is resilient distributed datasets (RDD), which enable clusters of machines to perform workloads in a coordinated, and fault-tolerant process. In more recent versions of Spark, the Dataframe API provides an abstraction on top of RDDs that resembles the same data structure in R and Pandas. PySpark is the Python interface to Spark, and it provides an API for working with large-scale datasets in a distributed computing environment. PySpark is an extremely valuable tool for data scientists, because it can streamline the process for translating prototype models into production-grade model workflows. At Zynga, our data science team owns a number of production-grade systems that provide useful signals to our game and marketing teams. By using PySpark, we've been able to reduce the amount of support we need from engineering teams to scale up models from concept to production.
Interpretability: Cracking open the black box – Part I
Interpretability is the degree to which a human can understand the cause of a decision – Miller, Tim[1] Explainable AI (XAI) is a sub-field of AI which has been gaining ground in the recent past. And as I machine learning practitioner dealing with customers day in and day out, I can see why. I've been an analytics practitioner for more than 5 years and I swear, the hardest part of a machine learning project is not creating the perfect model which beats all the benchmarks. It's the part where you convince the customer why and how it works. Humans always had a dichotomy when faced with the unknown.
Tropical Geometry and Piecewise-Linear Approximation of Curves and Surfaces on Weighted Lattices
Maragos, Petros, Theodosis, Emmanouil
Tropical Geometry and Mathematical Morphology share the same max-plus and min-plus semiring arithmetic and matrix algebra. In this chapter we summarize some of their main ideas and common (geometric and algebraic) structure, generalize and extend both of them using weighted lattices and a max-$\star$ algebra with an arbitrary binary operation $\star$ that distributes over max, and outline applications to geometry, machine learning, and optimization. Further, we generalize tropical geometrical objects using weighted lattices. Finally, we provide the optimal solution of max-$\star$ equations using morphological adjunctions that are projections on weighted lattices, and apply it to optimal piecewise-linear regression for fitting max-$\star$ tropical curves and surfaces to arbitrary data that constitute polygonal or polyhedral shape approximations. This also includes an efficient algorithm for solving the convex regression problem of data fitting with max-affine functions.
Privacy-preserving data sharing via probabilistic modelling
Jälkö, Joonas, Lagerspetz, Eemil, Haukka, Jari, Tarkoma, Sasu, Kaski, Samuel, Honkela, Antti
Differential privacy allows quantifying privacy loss from computations on sensitive personal data. This loss grows with the number of accesses to the data, making it hard to open the use of such data while respecting privacy. To avoid this limitation, we propose privacy-preserving release of a synthetic version of a data set, which can be used for an unlimited number of analyses with any methods, without affecting the privacy guarantees. The synthetic data generation is based on differentially private learning of a generative probabilistic model which can capture the probability distribution of the original data. We demonstrate empirically that we can reliably reproduce statistical discoveries from the synthetic data. We expect the method to have broad use in sharing anonymized versions of key data sets for research.