Goto

Collaborating Authors

 Regression


A Large-Scale Study of Programming Languages and Code Quality in GitHub

Communications of the ACM

What is the effect of programming languages on software quality? This question has been a topic of much debate for a very long time. In this study, we gather a very large data set from GitHub (728 projects, 63 million SLOC, 29,000 authors, 1.5 million commits, in 17 languages) in an attempt to shed some empirical light on this question. This reasonably large sample size allows us to use a mixed-methods approach, combining multiple regression modeling with visualization and text analytics, to study the effect of language features such as static versus dynamic typing and allowing versus disallowing type confusion on software quality. By triangulating findings from different methods, and controlling for confounding effects such as team size, project size, and project history, we report that language design does have a significant, but modest effect on software quality. Most notably, it does appear that disallowing type confusion is modestly better than allowing it, and among functional languages, static typing is also somewhat better than dynamic typing. We also find that functional languages are somewhat better than procedural languages. It is worth noting that these modest effects arising from language design are overwhelmingly dominated by the process factors such as project size, team size, and commit size. However, we caution the reader that even these modest effects might quite possibly be due to other, intangible process factors, for example, the preference of certain personality types for functional, static languages that disallow type confusion. A variety of debates ensue during discussions whether a given programming language is "the right tool for the job." While some of these debates may appear to be tinged with an almost religious fervor, most agree that programming language choice can impact both the coding process and the resulting artifact. Advocates of strong, static typing tend to believe that the static approach catches defects early; for them, an ounce of prevention is worth a pound of cure. Dynamic typing advocates argue, however, that conservative static type checking is wasteful of developer resources, and that it is better to rely on strong dynamic type checking to catch type errors as they arise. These debates, however, have largely been of the armchair variety, supported only by anecdotal evidence. This is perhaps not unreasonable; obtaining empirical evidence to support such claims is a challenging task given the number of other factors that influence software engineering outcomes, such as code quality, language properties, and usage domains.


Classification with Scikit-Learn

#artificialintelligence

For python programmers, scikit-learn is one of the best libraries to build Machine Learning applications with. It is ideal for beginners because it has a really simple interface, it is well documented with many examples and tutorials. Besides supervised machine learning (classification and regression), it can also be used for clustering, dimensionality reduction, feature extraction and engineering, and pre-processing the data. The interface is consistent over all of these methods, so it is not only easy to use, but it is also easy to construct a large ensemble of classifiers/regression models and train them with the same commands. In this blog lets have a look at how to build, train, evaluate and validate a classifier with scikit-learn and in this way get familiar with the scikit-learn library.


On Stein's Identity and Near-Optimal Estimation in High-dimensional Index Models

arXiv.org Machine Learning

We consider estimating the parametric components of semi-parametric multiple index models in a high-dimensional non-Gaussian setting. Our estimators leverage the score function based second-order Stein's lemma and do not require Gaussian or elliptical symmetry assumptions made in the literature. Moreover, to handle score functions and response variables that are heavy-tailed, our estimators are constructed via carefully thresholding their empirical counterparts. We show that our estimator achieves near- optimal statistical rate of convergence in several settings. We supplement our theoretical results via simulation experiments that confirm the theory.


Building a Logistic Regression model from scratch

#artificialintelligence

Do you understand how does logistic regression work? If your answer is yes, I have a challenge for you to solve. Here is an extremely simple logistic problem. Here is the catch: YOU CANNOT USE ANY PREDEFINED LOGISTIC FUNCTION! Here is a small survey which I did with professionals with 1-3 years of experience in analytics industry (my sample size is 200).


Deep Learning Prerequisites: Linear Regression in Python

@machinelearnbot

This course teaches you about one popular technique used in machine learning, data science and statistics: linear regression. We cover the theory from the ground up: derivation of the solution, and applications to real-world problems. We show you how one might code their own linear regression module in Python. Linear regression is the simplest machine learning model you can learn, yet there is so much depth that you'll be returning to it for years to come. In the first section, I will show you how to use 1-D linear regression to prove that Moore's Law is true.


Regression Models Coursera

@machinelearnbot

About this course: Linear models, as their name implies, relates an outcome to a set of predictors of interest using linear assumptions. Regression models, a subset of linear models, are the most important statistical analysis tool in a data scientist's toolkit. This course covers regression analysis, least squares and inference using regression models. Special cases of the regression model, ANOVA and ANCOVA will be covered as well. Analysis of residuals and variability will be investigated.


Hybrid content-based and collaborative filtering recommendations with {ordinal} logistic regression (1): Feature engineering

@machinelearnbot

I will use {ordinal} clm() (and other cool R packages such as {text2vec} as well) here to develop a hybrid content-based, collaborative filtering, and (obivously) model-based approach to solve the recommendation problem on the MovieLens 100K dataset in R. All R code used in this project can be obtained from the respective GitHub repository; the chunks of code present in the body of the post illustrate the essential steps only. The MovieLens 100K dataset can be obtained from the GroupLens research laboratory of the Department of Computer Science and Engineering at the University of Minnesota. The first part of the study introduces the new approach and refers to the feature engineering steps that are performed by the OrdinalRecommenders_1.R script (found on GitHub). The second part, to be published soon, relies on the R code in OrdinalRecommenders_3.R and presents the model training, cross-validation, and analyses steps. The OrdinalRecommenders_2.R script encompasses some tireless for-looping in R (a bad habbit indeed) across the dataset only in order to place the information from the dataset in the format needed for the modeling phase.


Machine Learning for Data Analysis Coursera

@machinelearnbot

Lasso regression analysis is a shrinkage and variable selection method for linear regression models. The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that causes regression coefficients for some variables to shrink toward zero. Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable.


Combining Lexical and Syntactic Features for Detecting Content-Dense Texts in News

Journal of Artificial Intelligence Research

Content-dense news report important factual information about an event in direct, succinct manner. Information seeking applications such as information extraction, question answering and summarization normally assume all text they deal with is content-dense. Here we empirically test this assumption on news articles from the business, U.S. international relations, sports and science journalism domains. Our findings clearly indicate that about half of the news texts in our study are in fact not content-dense and motivate the development of a supervised content-density detector. We heuristically label a large training corpus for the task and train a two-layer classifying model based on lexical and unlexicalized syntactic features. On manually annotated data, we compare the performance of domain-specific classifiers, trained on data only from a given news domain and a general classifier in which data from all four domains is pooled together. Our annotation and prediction experiments demonstrate that the concept of content density varies depending on the domain and that naive annotators provide judgement biased toward the stereotypical domain label. Domain-specific classifiers are more accurate for domains in which content-dense texts are typically fewer. Domain independent classifiers reproduce better naive crowdsourced judgements. Classification prediction is high across all conditions, around 80%.


How to prevent adversarial attacks on AI systems

#artificialintelligence

Adversarial attacks are an increasingly worrisome threat to the performance of artificial intelligence applications. If an attacker can introduce nearly invisible alterations to image, video, speech, and other data for the purpose of fooling AI-powered classification tools, it will be difficult to trust this otherwise sophisticated technology to do its job effectively. Imagine how such attacks could undermine AI-powered autonomous vehicles ability to recognize obstacles, content filters' effectiveness in blocking disturbing images, or in access systems' ability to deter unauthorized entry. Some people argue that adversarial threats stem from "deep flaws" in the neural net technology that powers today's AI. After all, it's well-understood that many machine learning algorithms--even traditional logistic-regression classifiers--are vulnerable to adversarial attacks.