Regression
The 10 Statistical Techniques Data Scientists Need to Master
Regardless of where you stand on the matter of Data Science sexiness, it's simply impossible to ignore the continuing importance of data, and our ability to analyze, organize, and contextualize it. Drawing on their vast stores of employment data and employee feedback, Glassdoor ranked Data Scientist #1 in their 25 Best Jobs in America list. So the role is here to stay, but unquestionably, the specifics of what a Data Scientist does will evolve. With technologies like Machine Learning becoming ever-more common place, and emerging fields like Deep Learning gaining significant traction amongst researchers and engineers -- and the companies that hire them -- Data Scientists continue to ride the crest of an incredible wave of innovation and technological progress. While having a strong coding ability is important, data science isn't all about software engineering (in fact, have a good familiarity with Python and you're good to go).
Machine Learning Algorithms: Which One to Choose for Your Problem - DZone AI
When I was beginning my journey in data science, I often faced the problem of choosing the most appropriate algorithm for my specific problem. If you're like me, when you open some article about machine learning algorithms, you see dozens of detailed descriptions. The paradox is that this doesn't make it easier to choose which one to use. In this article for Statsbot, I will try to explain basic concepts and give some intuition of using different kinds of machine learning algorithms for different tasks. At the end of the article, you'll find a structured overview of the main features of described algorithms.
Top 6 errors novice machine learning engineers make
In machine learning, there are many ways to build a product or solution and each way assumes something different. Many times, it's not obvious how to navigate and identify which assumptions are reasonable. People new to machine learning make mistakes, which in hindsight will often feel silly. I've created a list of the top mistakes that novice machine learning engineers make. Hopefully, you can learn from these common errors and create more robust solutions that bring real value.
Bitcoin Price Forecasting Using Model with Experts Opinions
One of the main goals in the Bitcoin analytics is price forecasting. There are many factors which influence the price dynamics. The most important factors are: the interaction between supply and demand, attractiveness for investors, financial and macroeconomics indicators, technical indicators such as difficulty, how many blocks were created recently, etc. A very important impact on the cryptocurrency price has trends in social networks and search engines. Using these factors, one can create a regression model with good fitting of bitcoin price on the historical data.
Contextual Regression: An Accurate and Conveniently Interpretable Nonlinear Model for Mining Discovery from Scientific Data
Machine learning algorithms such as linear regression, SVM and neural network have played an increasingly important role in the process of scientific discovery. However, none of them is both interpretable and accurate on nonlinear datasets. Here we present contextual regression, a method that joins these two desirable properties together using a hybrid architecture of neural network embedding and dot product layer. We demonstrate its high prediction accuracy and sensitivity through the task of predictive feature selection on a simulated dataset and the application of predicting open chromatin sites in the human genome. On the simulated data, our method achieved high fidelity recovery of feature contributions under random noise levels up to 200%. On the open chromatin dataset, the application of our method not only outperformed the state of the art method in terms of accuracy, but also unveiled two previously unfound open chromatin related histone marks. Our method can fill the blank of accurate and interpretable nonlinear modeling in scientific data mining tasks.
Regression Analysis: A Primer
Regression is arguably the workhorse of statistics. Despite its popularity, however, it may also be the most misunderstood. The answer might surprise you: There is no such thing as Regression. The Dependent Variable is something you want to predict or explain. In a Marketing Research context it might be Purchase Interest measured on a 0-10 rating scale.
Numeric Computation and Statistical Data Analysis on the Java Platform (Advanced Information and Knowledge Processing): Sergei V. Chekanov: 9783319285290: Amazon.com: Books
Numerical computation, knowledge discovery and statistical data analysis integrated with powerful 2D and 3D graphics for visualization are the key topics of this book. The Python code examples powered by the Java platform can easily be transformed to other programming languages, such as Java, Groovy, Ruby and BeanShell. This book equips the reader with a computational platform which, unlike other statistical programs, is not limited by a single programming language. The author focuses on practical programming aspects and covers a broad range of topics, from basic introduction to the Python language on the Java platform (Jython), to descriptive statistics, symbolic calculations, neural networks, non-linear regression analysis and many other data-mining topics. He discusses how to find regularities in real-world data, how to classify data, and how to process data for knowledge discoveries.
SGDLibrary: A MATLAB library for stochastic gradient descent algorithms
We consider the problem of finding the minimizer of a function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ of the form $\min f(w) = \frac{1}{n}\sum_{i}f_i({w})$. This problem has been studied intensively in recent years in machine learning research field. One typical but promising approach for large-scale data is stochastic optimization algorithm. SGDLibrary is a flexible, extensible and efficient pure-Matlab library of a collection of stochastic optimization algorithms. The purpose of the library is to provide researchers and implementers a comprehensive evaluation environment of those algorithms on various machine learning problems.
Regularization via Mass Transportation
Shafieezadeh-Abadeh, Soroosh, Kuhn, Daniel, Esfahani, Peyman Mohajerin
The goal of regression and classification methods in supervised learning is to minimize the empirical risk, that is, the expectation of some loss function quantifying the prediction error under the empirical distribution. When facing scarce training data, overfitting is typically mitigated by adding regularization terms to the objective that penalize hypothesis complexity. In this paper we introduce new regularization techniques using ideas from distributionally robust optimization, and we give new probabilistic interpretations to existing techniques. Specifically, we propose to minimize the worst-case expected loss, where the worst case is taken over the ball of all (continuous or discrete) distributions that have a bounded transportation distance from the (discrete) empirical distribution. By choosing the radius of this ball judiciously, we can guarantee that the worst-case expected loss provides an upper confidence bound on the loss on test data, thus offering new generalization bounds. We prove that the resulting regularized learning problems are tractable and can be tractably kernelized for many popular loss functions. We validate our theoretical out-of-sample guarantees through simulated and empirical experiments.
Maximum Margin Interval Trees
Drouin, Alexandre, Hocking, Toby Dylan, Laviolette, François
Learning a regression function using censored or interval-valued output data is an important problem in fields such as genomics and medicine. The goal is to learn a real-valued prediction function, and the training output labels indicate an interval of possible values. Whereas most existing algorithms for this task are linear models, in this paper we investigate learning nonlinear tree models. We propose to learn a tree by minimizing a margin-based discriminative objective function, and we provide a dynamic programming algorithm for computing the optimal solution in log-linear time. We show empirically that this algorithm achieves state-of-the-art speed and prediction accuracy in a benchmark of several data sets.