Accuracy
Insight: Data-driven Energy Industry Is Ripe For Growth
Deloitte's examination of the incentive to integrate sensing, communications and analytics technologies in the oil and gas industry a couple of years ago noted that "increased data capture and analysis can likely save millions of dollars by eliminating as many as half of a company's unplanned well outages and boosting crude output by as much as 10% over a two-year period."
A Deep Belief Network Based Machine Learning System for Risky Host Detection
Feng, Wangyan, Wu, Shuning, Li, Xiaodan, Kunkle, Kevin
To assure cyber security of an enterprise, typically SIEM (Security Information and Event Management) system is in place to normalize security event from different preventive technologies and flag alerts. Analysts in the security operation center (SOC) investigate the alerts to decide if it is truly malicious or not. However, generally the number of alerts is overwhelming with majority of them being false positive and exceeding the SOC's capacity to handle all alerts. There is a great need to reduce the false positive rate as much as possible. While most previous research focused on network intrusion detection, we focus on risk detection and propose an intelligent Deep Belief Network machine learning system. The system leverages alert information, various security logs and analysts' investigation results in a real enterprise environment to flag hosts that have high likelihood of being compromised. Text mining and graph based method are used to generate targets and create features for machine learning. In the experiment, Deep Belief Network is compared with other machine learning algorithms, including multi-layer neural network, random forest, support vector machine and logistic regression. Results on real enterprise data indicate that the deep belief network machine learning system performs better than other algorithms for our problem and is six times more effective than current rule-based system. We also implement the whole system from data collection, label creation, feature engineering to host score generation in a real enterprise production environment.
Objective evaluation metrics for automatic classification of EEG events
Ziyabari, Saeedeh, Shah, Vinit, Golmohammadi, Meysam, Obeid, Iyad, Picone, Joseph
The evaluation of machine learning algorithms in biomedical fields for applications involving sequential data lacks standardization. Common quantitative scalar evaluation metrics such as sensitivity and specificity can often be misleading depending on the requirements of the application. Evaluation metrics must ultimately reflect the needs of users yet be sufficiently sensitive to guide algorithm development. Feedback from critical care clinicians who use automated event detection software in clinical applications has been overwhelmingly emphatic that a low false alarm rate, typically measured in units of the number of errors per 24 hours, is the single most important criterion for user acceptance. Though using a single metric is not often as insightful as examining performance over a range of operating conditions, there is a need for a single scalar figure of merit. In this paper, we discuss the deficiencies of existing metrics for a seizure detection task and propose several new metrics that offer a more balanced view of performance. We demonstrate these metrics on a seizure detection task based on the TUH EEG Corpus. We show that two promising metrics are a measure based on a concept borrowed from the spoken term detection literature, Actual Term-Weighted Value, and a new metric, Time-Aligned Event Scoring (TAES), that accounts for the temporal alignment of the hypothesis to the reference annotation. We also demonstrate that state of the art technology based on deep learning, though impressive in its performance, still needs significant improvement before it will meet very strict user acceptance guidelines.
Extrapolating Expected Accuracies for Large Multi-Class Problems
Zheng, Charles, Achanta, Rakesh, Benjamini, Yuval
Many machine learning tasks are interested in recognizing or identifying an individual instance within a large set of possible candidates. These problems are usually modeled as multi-class classification problems, with a large and possibly complex label set. Leading examples include detecting the speaker from his voice patterns (Togneri and Pullella, 2011), identifying the author from her written text (Stamatatos et al., 2014), or labeling the object category from its image (Duygulu et al., 2002, Deng et al., 2010, Oquab et al., 2014). In all these examples, the algorithm observes an input x, and uses the classifier function h to guess the label y from a large label set S. 1 There are multiple practical challenges in developing classifiers for large label sets. Collecting high quality training data is perhaps the main obstacle, as the costs scale with the number of classes. It can be affordable to first collect data for a small set of classes, even if the long-term goal is to generalize to a larger set. Furthermore, classifier development can be accelerated by training first on fewer classes, as each training cycle may require substantially less resources. Indeed, due to interest in how small-set performance generalizes to larger sets, such comparisons can found in the literature (Oquab et al., 2014, Griffin et al., 2007). A natural question is: how does changing the size of the label set affect the classification accuracy?
More Data Mining with R Udemy
In data mining and association rule learning, lift is a measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model. A targeting model is doing a good job if the response within the target is much better than the average for the population as a whole. Lift is simply the ratio of these values: target response divided by average response. For example, suppose a population has an average response rate of 5%, but a certain model (or rule) has identified a segment with a response rate of 20%. Then that segment would have a lift of 4.0 (20%/5%).
[P] Melanoma detection model (http://melanoma.modelderm.com) • r/MachineLearning
I made a melanoma diagnosis model named "Model Melanoma" based on deep learning algorithm (http://melanoma.modelderm.com). ResNet152 and VGG19 were used as a CNN model, around 300,000 images (179 classes) were used as a training dataset. For reference, the skin cancer detection model published on nature showed 0.96 with 225 melanocytic (58 malignant, 167 benign) test images of the Edinburgh dermofit library. The web-based test platform provides the miss rate or false negative rate (1-sensitivity) in the diagnosis of melanoma. In addition, we made 176 skin diseases diagnosis model (http://modelderm.com)
Data Science with Python: Exploratory Analysis with Movie-Ratings and Fraud Detection with Credit-Card Transactions
The following problems are taken from the projects / assignments in the edX course Python for Data Science and the coursera course Applied Machine Learning in Python (UMich). The IMDB Movie Dataset (MovieLens 20M) is used for the analysis. The dataset is downloaded from here . This dataset contains 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users and was released in 4/2015. Understand the trend in average ratings for different movie genres over years (from 1995 to 2015) and Correlation between the trends for different genres (8 different genres are considered: Animation, Comedy, Romance, Thriller, Horror, Sci-Fi and Musical).
A Real-Time Autonomous Highway Accident Detection Model Based on Big Data Processing and Computational Intelligence
Ozbayoglu, A. Murat, Kucukayan, Gokhan, Dogdu, Erdogan
Due to increasing urban population and growing number of motor vehicles, traffic congestion is becoming a major problem of the 21st century. One of the main reasons behind traffic congestion is accidents which can not only result in casualties and losses for the participants, but also in wasted and lost time for the others that are stuck behind the wheels. Early detection of an accident can save lives, provides quicker road openings, hence decreases wasted time and resources, and increases efficiency. In this study, we propose a preliminary real-time autonomous accident-detection system based on computational intelligence techniques. Istanbul City traffic-flow data for the year 2015 from various sensor locations are populated using big data processing methodologies. The extracted features are then fed into a nearest neighbor model, a regression tree, and a feed-forward neural network model. For the output, the possibility of an occurrence of an accident is predicted. The results indicate that even though the number of false alarms dominates the real accident cases, the system can still provide useful information that can be used for status verification and early reaction to possible accidents.
AWS raises machine learning expectations for cloud security
Turning on machine-learning based cloud security tools like Amazon Web Service's (AWS) new GuardDuty and Macie offerings might be a no-brainer for AWS customers. It raises the bar for attackers, but will not protect you from sophisticated adversaries, experts say. The AWS Macie service, announced in August, trains on the content of users' Amazon S3 buckets and alerts customers when it detects suspicious activity, with a focus on PCI, HIPAA, and GDPR compliance. AWS GuardDuty, a complementary offering announced at the end of November, uses machine learning to analyze AWS CloudTrail, VPC Flow Logs, and AWS DNS logs. Like Macie, GuardDuty focuses on anomaly detection to alert customers to suspicious activity.