Accuracy
Preventive Leak Detection for High Pressure Gas Transmission Networks
Zhang, Rui (IBM, T.J. Watson Research Center) | Huang, Jefferson (Cornell University) | Kumar, Tarun (IBM, T.J. Watson Research Center)
Recent developments in SCADA (Supervisory Control and Data Acquisition) systems for physical infrastructure, such as high pressure gas pipeline systems and electric grids, have generated enormous amounts of time series data. This data brings great opportunities for advanced knowledge discovery and data mining methods to identify system failures faster and earlier than operation experts. This paper presents our effort in collaboration with a utility company to solve a grand challenge; namely, to use advanced data mining methods to detect leaks on a high pressure gas transmission system. Leak detection models with unsupervised learning tasks were developed analyzing billions of data records to identify leaks of different sizes and impacts, with very low false positive rates. In particular, our solution was able to identify small leaks leading to rupture events. The model also identified small leaks not identifiable with current detection systems. Such high-fidelity early identification enables operation personnel to take preventive measures against possible catastrophic events. We then formulate several generic detection methods with models derived from time series anomaly detection methods. We show that our leak detection models are superior to the SCADA alarm system, a mass balance model and other generic time series anomaly detection models in terms of both detection accuracy and computation time.
ATOL: A Framework for Automated Analysis and Categorization of the Darkweb Ecosystem
Ghosh, Shalini (SRI International) | Porras, Phillip (SRI International) | Yegneswaran, Vinod (SRI International) | Nitz, Ken (SRI International) | Das, Ariyam (University of California, Los Angeles)
We present a framework for automated analysis and categorization of .onion websites in the darkweb to facilitate analyst situational awareness of new content that emerges from this dynamic landscape. Over the last two years, our team has developed a large-scale darkweb crawling infrastructure called OnionCrawler that acquires new onion domains on a daily basis, and crawls and indexes millions of pages from these new and previously known .onion sites. It stores this data into a research repository designed to help better understand Tor’s hidden service ecosystem. The analysis component of our framework is called Automated Tool for Onion Labeling (ATOL), which introduces a two-stage thematic labeling strategy: (1) it learns descriptive and discriminative keywords for different categories, and (2) uses these terms to map onion site content to a set of thematic labels. We also present empirical results of ATOL and our ongoing experimentation with it, as we have gained experience applying it to the entirety of our darkweb repository, now over 70 million indexed pages. We find that ATOL can perform site-level thematic label assignment more accurately than keywordbased schemes developed by domain experts — we expand the analyst-provided keywords using an automatic keyword discovery algorithm, and get 12% gain in accuracy by using a machine learning classification model. We also show how ATOL can discover categories on previously unlabeled onions and discuss applications of ATOL in supporting various analyses and investigations of the darkweb.
Collective Classification of Social Network Spam
Brophy, Jonathan (University of Oregon) | Lowd, Daniel (University of Oregon)
Unsolicited or unwanted messages is a byproduct of virtually every popular social media website. Spammers have become increasingly proficient at bypassing conventional spam filters, prompting a stronger effort to develop new methods that accurately detect spam while simultaneously acting as a more robust classifier against users that modify their behavior in order to avoid detection. This paper shows the usefulness of a relational model that works in conjunction with an independent model. First, an independent model is built using features that characterize individual comments and users, capturing the cases where spam is obvious. Second, a relational model is built, taking advantage of the interconnected nature of users and their comments. By feeding our initial predictions from the independent model into the relational model, we can start to propagate information about spammers and spam comments to jointly infer the labels of all spam comments at the same time. This allows us to capture the obfuscated spam comments missed by the independent model that are only found by looking at the relational structure of the social network. The results from our experiments demonstrates the viability of our method, and shows that models utilizing the underlying structure of the social network are more effective at detecting spam than ones that do not.
Detection of Money Laundering Groups: Supervised Learning on Small Networks
Savage, David (RMIT University) | Wang, Qingmai (RMIT University) | Zhang, Xiuzhen (RMIT University) | Chou, Pauline (AUSTRAC) | Yu, Xinghuo (RMIT University)
Money laundering is a major global problem, enabling criminal organisations to hide their ill-gotten gains and to finance further operations. Prevention of money laundering is seen as a high priority by many governments, however detection of money laundering without prior knowledge of predicate crimes remains a significant challenge. Previous detection systems have tended to focus on individuals, considering transaction histories and applying anomaly detection to identify suspicious behaviour. However, money laundering involves groups of collaborating individuals and evidence of money laundering may only be apparent when the collective behaviour of these groups is considered. In this paper we describe a detection system that is capable of analysing group behaviour, using a combination of network analysis and supervised learning. This system is designed for real-world application and operates on networks consisting of millions of interacting parties. Evaluation of the system using real-world data indicates that suspicious activity is successfully detected. Importantly, the system exhibits a low rate of false positives, and is therefore suitable for use in a live intelligence environment.
Network-based methods for outcome prediction in the "sample space"
In this thesis we present the novel semi-supervised network-based algorithm P-Net, which is able to rank and classify patients with respect to a specific phenotype or clinical outcome under study. The peculiar and innovative characteristic of this method is that it builds a network of samples/patients, where the nodes represent the samples and the edges are functional or genetic relationships between individuals (e.g. similarity of expression profiles), to predict the phenotype under study. In other words, it constructs the network in the "sample space" and not in the "biomarker space" (where nodes represent biomolecules (e.g. genes, proteins) and edges represent functional or genetic relationships between nodes), as usual in state-of-the-art methods. To assess the performances of P-Net, we apply it on three different publicly available datasets from patients afflicted with a specific type of tumor: pancreatic cancer, melanoma and ovarian cancer dataset, by using the data and following the experimental set-up proposed in two recently published papers [Barter et al., 2014, Winter et al., 2012]. We show that network-based methods in the "sample space" can achieve results competitive with classical supervised inductive systems. Moreover, the graph representation of the samples can be easily visualized through networks and can be used to gain visual clues about the relationships between samples, taking into account the phenotype associated or predicted for each sample. To our knowledge this is one of the first works that proposes graph-based algorithms working in the "sample space" of the biomolecular profiles of the patients to predict their phenotype or outcome, thus contributing to a novel research line in the framework of the Network Medicine.
How Are Precision and Recall Calculated?
Calculating precision and recall is actually quite easy. Imagine there are 100 positive cases among 10,000 cases. You want to predict which ones are positive, and you pick 200 to have a better chance of catching many of the 100 positive cases. You record the IDs of your predictions, and when you get the actual results you sum up how many times you were right or wrong.
Recovering True Classifier Performance in Positive-Unlabeled Learning
Jain, Shantanu, White, Martha, Radivojac, Predrag
A common approach in positive-unlabeled learning is to train a classification model between labeled and unlabeled data. This strategy is in fact known to give an optimal classifier under mild conditions; however, it results in biased empirical estimates of the classifier performance. In this work, we show that the typically used performance measures such as the receiver operating characteristic curve, or the precision-recall curve obtained on such data can be corrected with the knowledge of class priors; i.e., the proportions of the positive and negative examples in the unlabeled data. We extend the results to a noisy setting where some of the examples labeled positive are in fact negative and show that the correction also requires the knowledge of the proportion of noisy examples in the labeled positives. Using state-of-the-art algorithms to estimate the positive class prior and the proportion of noise, we experimentally evaluate two correction approaches and demonstrate their efficacy on real-life data.
Introduction to Machine Learning with Python
Machine learning has long powered many products we interact with daily–from "intelligent" assistants like Apple's Siri and Google Now, to recommendation engines like Amazon's that suggest new products to buy, to the ad ranking systems used by Google and Facebook. More recently, machine learning has entered the public consciousness because of advances in "deep learning"–these include AlphaGo's defeat of Go grandmaster Lee Sedol and impressive new products around image recognition and machine translation. In this series, we'll give an introduction to some powerful but generally applicable techniques in machine learning. These include deep learning but also more traditional methods that are often all the modern business needs. After reading the articles in the series, you should have the knowledge necessary to embark on concrete machine learning experiments in a variety of areas on your own.
WrestleMania 33 Matches: Predictions For The Card At WWE's Biggest 2017 PPV After The Royal Rumble
WrestleMania 33 is still more than two months away, but the pieces for the match card are falling into place. Following the Royal Rumble, there is a much better sense of which matches will headline WWE's biggest show of 2017. WrestleMania 32 featured 12 matches and six hours of wrestling, and the pay-per-view on April 2 in Orlando should be similar. Much of the card will be determined by what happens at Elimination Chamber on Feb. 12 and Fastlane on March 5. Rumors have been circulating for weeks regarding the potential WrestleMania 33 match card, and recent events indicate that some of the reports are correct. Plans are constantly changing in WWE, though it isn't hard to guess who certain wrestlers will face on the "grandest stage of them all."
When Algorithms Come for Our Children
Consider the tragedy of a child killed by neglect and abuse. Now consider the tragedy of a child taken from parents who would not have criminally abused her. Computer algorithms might soon help humans make such difficult decisions -- but only if we recognize the myriad ways in which they can go wrong. In countless cities across the nation, child welfare services make extremely tough calls every day. With limited resources and information, they must often rely on gut instinct in predicting who is most vulnerable.