Performance Analysis
Further Exploration of the Dendritic Cell Algorithm: Antigen Multiplier and Time Windows
Gu, Feng, Greensmith, Julie, Aickelin, Uwe
As an immune-inspired algorithm, the Dendritic Cell Algorithm (DCA), produces promising performances in the field of anomaly detection. This paper presents the application of the DCA to a standard data set, the KDD 99 data set. The results of different implementation versions of the DXA, including the antigen multiplier and moving time windows are reported. The real-valued Negative Selection Algorithm (NSA) using constant-sized detectors and the C4.5 decision tree algorithm are used, to conduct a baseline comparison. The results suggest that the DCA is applicable to KDD 99 data set, and the antigen multiplier and moving time windows have the same effect on the DCA for this particular data set. The real-valued NSA with constant-sized detectors is not applicable to the data set, and the C4.5 decision tree algorithm provides a benchmark of the classification performance for this data set.
Security Analysis of Online Centroid Anomaly Detection
Security issues are crucial in a number of machine learning applications, especially in scenarios dealing with human activity rather than natural phenomena (e.g., information ranking, spam detection, malware detection, etc.). It is to be expected in such cases that learning algorithms will have to deal with manipulated data aimed at hampering decision making. Although some previous work addressed the handling of malicious data in the context of supervised learning, very little is known about the behavior of anomaly detection methods in such scenarios. In this contribution we analyze the performance of a particular method -- online centroid anomaly detection -- in the presence of adversarial noise. Our analysis addresses the following security-related issues: formalization of learning and attack processes, derivation of an optimal attack, analysis of its efficiency and constraints. We derive bounds on the effectiveness of a poisoning attack against centroid anomaly under different conditions: bounded and unbounded percentage of traffic, and bounded false positive rate. Our bounds show that whereas a poisoning attack can be effectively staged in the unconstrained case, it can be made arbitrarily difficult (a strict upper bound on the attacker's gain) if external constraints are properly used. Our experimental evaluation carried out on real HTTP and exploit traces confirms the tightness of our theoretical bounds and practicality of our protection mechanisms.
Detecting Botnets Through Log Correlation
Al-Hammadi, Yousof, Aickelin, Uwe
Abstract-- Botnets, which consist of thousands of compromised machines, can cause significant threats to other systems by launching Distributed Denial of Service (DDoS) attacks, keylogging, and backdoors. In response to these threats, new effective techniques are needed to detect the presence of botnets. In this paper, we have used an interception technique to monitor Windows Application Programming Interface (API) functions calls made by communication applications and store these calls with their arguments in log files. Our algorithm detects botnets based on monitoring abnormal activity by correlating the changes in log file sizes from different hosts. Recently, an explosive growth of coordinated attacks has been noticed [1][6].
ICD 10 Based Medical Expert System Using Fuzzy Temporal Logic
The expert opinion is necessary in medical decision making, since there are wide variations in clinical practices. Moreover, the growing need to assess and improve quality of health care has brought to light the possibility of developing and implementing clinical practice guidelines based on expert opinions. Even though the colleague's opinion helps in accessing information about real cases which is another important source of information, an important goal to reach when dealing with real medical cases is to have simultaneous access to the expert's opinion about the same indications of the real case being treated. The increase of the information volume in each medical field, due to the emergence of new discoveries, treatments, medicines and technologies, leads to a frequent need of consulting medical literature and in particular specialized revues and journals. Certainly, due to the huge volume of this information, a classified, targeted, access is necessary. In the field of medicine, Imprecision and Uncertainty play a large role in the process of diagnosis of disease that has most frequently been the focus of these applications. With the increased volume of information available to physicians from new medical technologies, the process of classifying different sets of symptoms under a single name and determining the appropriate therapeutic actions become increasingly difficult.
Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification
Is accurate classification possible in the absence of hand-labeled data? This paper introduces the Monotonic Feature (MF) abstraction--where the probability of class membership increases monotonically with the MF's value. The paper proves that when an MF is given, PAC learning is possible with no hand-labeled data under certain assumptions. We argue that MFs arise naturally in a broad range of textual classification applications. On the classic "20 Newsgroups" data set, a learner given an MF and unlabeled data achieves classification accuracy equal to that of a state-of-the-art semi-supervised learner relying on 160 hand-labeled examples. Even when MFs are not given as input, their presence or absence can be determined from a small amount of hand-labeled data, which yields a new semi-supervised learning method that reduces error by 15% on the 20 Newsgroups data.
Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification
Is accurate classification possible in the absence of hand-labeled data? This paper introduces the Monotonic Feature (MF) abstraction--where the probability of class membership increases monotonically with the MF's value. The paper proves that when an MF is given, PAC learning is possible with no hand-labeled data under certain assumptions. We argue that MFs arise naturally in a broad range of textual classification applications. On the classic "20 Newsgroups" data set, a learner given an MF and unlabeled data achieves classification accuracy equal to that of a state-of-the-art semi-supervised learner relying on 160 hand-labeled examples. Even when MFs are not given as input, their presence or absence can be determined from a small amount of hand-labeled data, which yields a new semi-supervised learning method that reduces error by 15% on the 20 Newsgroups data.
Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora
Yang, Shuang-hong, Zha, Hongyuan, Hu, Bao-gang
We propose Dirichlet-Bernoulli Alignment (DBA), a generative model for corpora in which each pattern (e.g., a document) contains a set of instances (e.g., paragraphs in the document) and belongs to multiple classes. By casting predefined classes as latent Dirichlet variables (i.e., instance level labels), and modeling the multi-label of each pattern as Bernoulli variables conditioned on the weighted empirical average of topic assignments, DBA automatically aligns the latent topics discovered from data to human-defined classes. DBA is useful for both pattern classification and instance disambiguation, which are tested on text classification and named entity disambiguation for web search queries respectively.
Bootstrapping from Game Tree Search
Veness, Joel, Silver, David, Blair, Alan, Uther, William
In this paper we introduce a new algorithm for updating the parameters of a heuristic evaluation function, by updating the heuristic towards the values computed by an alpha-beta search. Our algorithm differs from previous approaches to learning from search, such as Samuels checkers player and the TD-Leaf algorithm, in two key ways. First, we update all nodes in the search tree, rather than a single node. Second, we use the outcome of a deep search, instead of the outcome of a subsequent search, as the training signal for the evaluation function. We implemented our algorithm in a chess program Meep, using a linear heuristic function. After initialising its weight vector to small random values, Meep was able to learn high quality weights from self-play alone. When tested online against human opponents, Meep played at a master level, the best performance of any chess program with a heuristic learned entirely from self-play.
Overlaying classifiers: a practical approach for optimal ranking
Clémençcon, Stéphan J., Vayatis, Nicolas
ROC curves are one of the most widely used displays to evaluate performance of scoring functions. In the paper, we propose a statistical method for directly optimizing the ROC curve. The target is known to be the regression function up to an increasing transformation and this boils down to recovering the level sets of the latter. We propose to use classifiers obtained by empirical risk minimization of a weighted classification error and then to construct a scoring rule by overlaying these classifiers. We show the consistency and rate of convergence to the optimal ROC curve of this procedure in terms of supremum norm and also, as a byproduct of the analysis, we derive an empirical estimate of the optimal ROC curve.
AUC optimization and the two-sample problem
Vayatis, Nicolas, Depecker, Marine, Clémençcon, Stéphan J.
The purpose of the paper is to explore the connection between multivariate homogeneity testsand AUC optimization. The latter problem has recently received much attention in the statistical learning literature. From the elementary observation that,in the two-sample problem setup, the null assumption corresponds to the situation where the area under the optimal ROC curve is equal to 1/2, we propose atwo-stage testing method based on data splitting. A nearly optimal scoring function in the AUC sense is first learnt from one of the two half-samples. Data from the remaining half-sample are then projected onto the real line and eventually rankedaccording to the scoring function computed at the first stage. The last step amounts to performing a standard Mann-Whitney Wilcoxon test in the onedimensional framework.We show that the learning step of the procedure does not affect the consistency of the test as well as its properties in terms of power, provided the ranking produced is accurate enough in the AUC sense. The results of a numerical experiment are eventually displayed in order to show the efficiency of the method.