Accuracy
Innovated scalable efficient estimation in ultra-large Gaussian graphical models
Large-scale precision matrix estimation is of fundamental importance yet challenging in many contemporary applications for recovering Gaussian graphical models. In this paper, we suggest a new approach of innovated scalable efficient estimation (ISEE) for estimating large precision matrix. Motivated by the innovated transformation, we convert the original problem into that of large covariance matrix estimation. The suggested method combines the strengths of recent advances in high-dimensional sparse modeling and large covariance matrix estimation. Compared to existing approaches, our method is scalable and can deal with much larger precision matrices with simple tuning. Under mild regularity conditions, we establish that this procedure can recover the underlying graphical structure with significant probability and provide efficient estimation of link strengths. Both computational and theoretical advantages of the procedure are evidenced through simulation and real data examples.
The AI system that can detect 85% of cyber attacks, with a little human help
MIT scientists have built a hybrid human/artificial intelligence (AI) machine that they claim can learn how to detect 85% of cyber attacks โ that's roughly three times better than previous benchmarks โ while reducing false positive rates by a factor of 5. Nitesh Chawla, professor of computer science at Notre Dame University, said in a statement from MIT that the machine "has the potential to become a line of defense against attacks such as fraud, service abuse and account takeover." Researchers from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and the machine-learning startup PatternEx demonstrated the platform, called AI2, in a paper titled "AI2: Training a big data machine to defend". As the researchers describe the current state of the art, today's security systems are typically driven by either humans โ so-called "analyst-driven solutions" โ or by machine. The problem with security systems based on fixed rules is that they miss attacks that don't match those rules. Machine-learning approaches, as the name suggests, rely on an adaptive process that can trigger annoying numbers of false positives.
A Selection of Giant Radio Sources from NVSS
Results of the application of pattern recognition techniques to the problem of identifying Giant Radio Sources (GRS) from the data in the NVSS catalog are presented and issues affecting the process are explored. Decision-tree pattern recognition software was applied to training set source pairs developed from known NVSS large angular size radio galaxies. The full training set consisted of 51,195 source pairs, 48 of which were known GRS for which each lobe was primarily represented by a single catalog component. The source pairs had a maximum separation of 20 arc minutes and a minimum component area of 1.87 square arc minutes at the 1.4 mJy level. The importance of comparing resulting probability distributions of the training and application sets for cases of unknown class ratio is demonstrated. The probability of correctly ranking a randomly selected (GRS, non-GRS) pair from the best of the tested classifiers was determined to be 97.8 +/- 1.5%. The best classifiers were applied to the over 870,000 candidate pairs from the entire catalog. Images of higher ranked sources were visually screened and a table of over sixteen hundred candidates, including morphological annotation, is presented. These systems include doubles and triples, Wide-Angle Tail (WAT) and Narrow-Angle Tail (NAT), S- or Z-shaped systems, and core-jets and resolved cores. While some resolved lobe systems are recovered with this technique, generally it is expected that such systems would require a different approach.
Introduction to Machine Learning with scikit-learn - Machine Learning Mastery
The scikit-learn library is one of the most popular platforms for everyday machine learning and data science. The reason is because it is built upon Python, a fully featured programming language. But how do you get started with machine learning with scikit-learn. Kevin Markham is a data science trainer who created a series of 9 videos that show you exactly how to get started in machine learning with scikit-learn. In this post you will discover this series of videos and exactly what is covered, step-by-step to help you decide if the material will be useful to you.
Reelin' and ROCin', Receiver Operating Characteristic by David Lettier
Imagine standing by a murky stream. You notice objects floating passed you. Pulling out your notebook, you write down for each object how confident you are that it is a fish (between 0.0 and 1.0). Not all are fish with some pieces being trash. Downstream, your friend scoops up each object and writes down what it actually is.
Datacratic MLDB
The business world is full of streams of items that need to be filtered or evaluated: parts on an assembly line, resumรฉs in an application pile, emails in a delivery queue, transactions awaiting processing. Machine learning techniques are increasingly being used to make such processes more efficient: image processing to flag bad parts, text analysis to surface good candidates, spam filtering to sort email, fraud detection to lower transaction costs etc. In this article, I show how you can take business factors into account when using machine learning to solve these kinds of problems with binary classifiers. Specifically, I show how the concept of expected utility from the field of economics maps onto the Receiver Operating Characteristic (ROC) space often used by machine learning practitioners to compare and evaluate models for binary classification. I begin with a parable illustrating the dangers of not taking such factors into account. This concrete story is followed by a more formal mathematical look at the use of indifference curves in ROC space to avoid this kind of problem and guide model development. I wrap up with some recommendations for successfully using binary classifiers to solve business problems.
Energy Disaggregation for Real-Time Building Flexibility Detection
Mocanu, Elena, Nguyen, Phuong H., Gibescu, Madeleine
Energy is a limited resource which has to be managed wisely, taking into account both supply-demand matching and capacity constraints in the distribution grid. One aspect of the smart energy management at the building level is given by the problem of real-time detection of flexible demand available. In this paper we propose the use of energy disaggregation techniques to perform this task. Firstly, we investigate the use of existing classification methods to perform energy disaggregation. A comparison is performed between four classifiers, namely Naive Bayes, k-Nearest Neighbors, Support Vector Machine and AdaBoost. Secondly, we propose the use of Restricted Boltzmann Machine to automatically perform feature extraction. The extracted features are then used as inputs to the four classifiers and consequently shown to improve their accuracy. The efficiency of our approach is demonstrated on a real database consisting of detailed appliance-level measurements with high temporal resolution, which has been used for energy disaggregation in previous studies, namely the REDD. The results show robustness and good generalization capabilities to newly presented buildings with at least 96% accuracy.
Multilingual Twitter Sentiment Classification: The Role of Human Annotators
Mozetic, Igor, Grcar, Miha, Smailovic, Jasmina
What are the limits of automated Twitter sentiment classification? We analyze a large set of manually labeled tweets in different languages, use them as training data, and construct automated classification models. It turns out that the quality of classification models depends much more on the quality and size of training data than on the type of the model trained. Experimental results indicate that there is no statistically significant difference between the performance of the top classification models. We quantify the quality of training data by applying various annotator agreement measures, and identify the weakest points of different datasets. We show that the model performance approaches the inter-annotator agreement when the size of the training set is sufficiently large. However, it is crucial to regularly monitor the self- and inter-annotator agreements since this improves the training datasets and consequently the model performance. Finally, we show that there is strong evidence that humans perceive the sentiment classes (negative, neutral, and positive) as ordered.