Accuracy
A Popular Crime-Predicting Algorithms Performed Worse Than Mechanical Turks in One Study
The American criminal justice system couldn't get much less fair. Across the country, some 1.5 million people are locked up in state and federal prisons. More than 600,000 people, the vast majority of whom have yet to be convicted of a crime, sit behind bars in local jails. Black people make up 40 percent of those incarcerated, despite accounting for just 13 percent of the US population. With the size and cost of jails and prisons rising--not to mention the inherent injustice of the system--cities and states across the country have been lured by tech tools that promise to predict whether someone might commit a crime.
Top 100 Data science interview questions
Data science, also known as data-driven decision, is an interdisciplinery field about scientific methods, process and systems to extract knowledge from data in various forms, and take descision based on this knowledge. A data scientist should not only be evaluated only on his/her knowledge on mahine learning, but he/she should also have good expertise on statistics. I will try to start from very basics of data science and then slowly move to expert level. Supervised machine learning requires training labeled data. Unsupervised machine learning doesn't required labeled data.
McDiarmid Drift Detection Methods for Evolving Data Streams
Pesaranghader, Ali, Viktor, Herna, Paquet, Eric
Increasingly, Internet of Things (IoT) domains, such as sensor networks, smart cities, and social networks, generate vast amounts of data. Such data are not only unbounded and rapidly evolving. Rather, the content thereof dynamically evolves over time, often in unforeseen ways. These variations are due to so-called concept drifts, caused by changes in the underlying data generation mechanisms. In a classification setting, concept drift causes the previously learned models to become inaccurate, unsafe and even unusable. Accordingly, concept drifts need to be detected, and handled, as soon as possible. In medical applications and emergency response settings, for example, change in behaviours should be detected in near real-time, to avoid potential loss of life. To this end, we introduce the McDiarmid Drift Detection Method (MDDM), which utilizes McDiarmid's inequality in order to detect concept drift. The MDDM approach proceeds by sliding a window over prediction results, and associate window entries with weights. Higher weights are assigned to the most recent entries, in order to emphasize their importance. As instances are processed, the detection algorithm compares a weighted mean of elements inside the sliding window with the maximum weighted mean observed so far. A significant difference between the two weighted means, upper-bounded by the McDiarmid inequality, implies a concept drift. Our extensive experimentation against synthetic and real-world data streams show that our novel method outperforms the state-of-the-art. Specifically, MDDM yields shorter detection delays as well as lower false negative rates, while maintaining high classification accuracies.
Predicting Movie Genres Based on Plot Summaries
This project explores several Machine Learning methods to predict movie genres based on plot summaries. Naive Bayes, Word2Vec+XGBoost and Recurrent Neural Networks are used for text classification, while K-binary transformation, rank method and probabilistic classification with learned probability threshold are employed for the multi-label problem involved in the genre tagging task.Experiments with more than 250,000 movies show that employing the Gated Recurrent Units (GRU) neural networks for the probabilistic classification with learned probability threshold approach achieves the best result on the test set. The model attains a Jaccard Index of 50.0%, a F-score of 0.56, and a hit rate of 80.5%.
tau-FPL: Tolerance-Constrained Learning in Linear Time
Zhang, Ao, Li, Nan, Pu, Jian, Wang, Jun, Yan, Junchi, Zha, Hongyuan
Learning a classifier with control on the false-positive rate plays a critical role in many machine learning applications. Existing approaches either introduce prior knowledge dependent label cost or tune parameters based on traditional classifiers, which lack consistency in methodology because they do not strictly adhere to the false-positive rate constraint. In this paper, we propose a novel scoring-thresholding approach, tau-False Positive Learning (tau-FPL) to address this problem. We show the scoring problem which takes the false-positive rate tolerance into accounts can be efficiently solved in linear time, also an out-of-bootstrap thresholding method can transform the learned ranking function into a low false-positive classifier. Both theoretical analysis and experimental results show superior performance of the proposed tau-FPL over existing approaches.
Optimal Generalized Decision Trees via Integer Programming
Gunluk, Oktay, Kalagnanam, Jayant, Menickelly, Matt, Scheinberg, Katya
Decision trees have been a very popular class of predictive models for decades due to their interpretability and good performance on categorical features. However, they are not always robust and tend to overfit the data. Additionally, if allowed to grow large, they lose interpretability. In this paper, we present a novel mixed integer programming formulation to construct optimal decision trees of specified size. We take special structure of categorical features into account and allow combinatorial decisions (based on subsets of values of such a feature) at each node. We show that very good accuracy can be achieved with small trees using moderately-sized training sets. The optimization problems we solve are easily tractable with modern solvers.
How machine learning engineers can detect and debug algorithmic bias
Ben Lorica, O'Reilly's chief data scientist, has posted slides and notes from his talk at last December's Strata Data Conference in Singapore, "We need to build machine learning tools to augment machine learning engineers." Lorica describes a new job emerging in IT departments: "machine learning engineers," whose job is to adapt machine learning models for production environments. These new engineers run the risk of embedding algorithmic bias into their systems, which unfairly discriminate, create liability, and reduces the quality of the recommendations the systems produce. He presents a set of technical and procedural steps to take to minimize these risks, with links to the relevant papers and code. It's really required reading for anyone implementing a machine learning system in a production environment.
Introduction to Python Ensembles
Ensembles have rapidly become one of the hottest and most popular methods in applied machine learning. Virtually every winning Kaggle solution features them, and many data science pipelines have ensembles in them. Put simply, ensembles combine predictions from different models to generate a final prediction, and the more models we include the better it performs. Better still, because ensembles combine baseline predictions, they perform at least as well as the best baseline model. Ensembles give us a performance boost almost for free! An input array $X$ is fed through two preprocessing pipelines and then to a set of base learners $f {(i)}$. The ensemble combines all base learner predictions into a final prediction array $P$. In this post, we'll take you through the basics of ensembles -- what they are and why they work so well -- and provide a hands-on tutorial for building basic ensembles. To illustrate how ensembles work, we'll use a data set on U.S. political contributions.
MEBoost: Mixing Estimators with Boosting for Imbalanced Data Classification
Rayhan, Farshid, Ahmed, Sajid, Mahbub, Asif, Jani, Md. Rafsan, Shatabda, Swakkhar, Farid, Dewan Md., Rahman, Chowdhury Mofizur
Class imbalance problem has been a challenging research problem in the fields of machine learning and data mining as most real life datasets are imbalanced. Several existing machine learning algorithms try to maximize the accuracy classification by correctly identifying majority class samples while ignoring the minority class. However, the concept of the minority class instances usually represents a higher interest than the majority class. Recently, several cost sensitive methods, ensemble models and sampling techniques have been used in literature in order to classify imbalance datasets. In this paper, we propose MEBoost, a new boosting algorithm for imbalanced datasets. MEBoost mixes two different weak learners with boosting to improve the performance on imbalanced datasets. MEBoost is an alternative to the existing techniques such as SMOTEBoost, RUSBoost, Adaboost, etc. The performance of MEBoost has been evaluated on 12 benchmark imbalanced datasets with state of the art ensemble methods like SMOTEBoost, RUSBoost, Easy Ensemble, EUSBoost, DataBoost. Experimental results show significant improvement over the other methods and it can be concluded that MEBoost is an effective and promising algorithm to deal with imbalance datasets. The python version of the code is available here: https://github.com/farshidrayhanuiu/
We need to build machine learning tools to augment machine learning engineers
Check out the machine learning sessions at the Strata Data Conference in London, May 21-24, 2018. Hurry--best price ends February 23. In this post, I share slides and notes from a talk I gave in December 2017 at the Strata Data Conference in Singapore offering suggestions to companies that are actively deploying products infused with machine learning capabilities. Over the past few years, the data community has focused on infrastructure and platforms for data collection, including robust pipelines and highly scalable storage systems for analytics. According to a recent LinkedIn report, the top two emerging jobs are "machine learning engineer" and "data scientist."