Data Mining
Reports of the AAAI 2009 Spring Symposia
Bao, Jie (Rensselaer Polytechnic Institute) | Bojars, Uldis (National University of Ireland) | Choudhury, Ranzeem (Dartmouth College) | Ding, Li (Rensselaer Polytechnic Institute) | Greaves, Mark (Vulcan Inc.) | Kapoor, Ashish (Microsoft Research) | Louchart, Sandy (Heriot-Watt University) | Mehta, Manish (Georgia Institute of Technology) | Nebel, Bernhard (Albert-Ludwigs University Freiburg) | Nirenburg, Sergei (University of Maryland Baltimore County) | Oates, Tim (University of Maryland Baltimore County) | Roberts, David L. (Georgia Institute of Technology) | Sanfilippo, Antonio (Pacific Northwest National Laboratory) | Stojanovic, Nenad (University of Karlsruhe) | Stubbs, Kristen (iRobot Corportion) | Thomaz, Andrea L. (Georgia Institute of Technology) | Tsui, Katherine (University of Massachusetts Lowell) | Woelfl, Stefan (Albert-Ludwigs University Freiburg)
The Association for the Advancement of Artificial Intelligence, in cooperation with Stanford University's Department of Computer Science, was pleased to present the 2009 Spring Symposium Series, held Monday through Wednesday, March 23–25, 2009 at Stanford University. The titles of the nine symposia were Agents that Learn from Human Teachers, Benchmarking of Qualitative Spatial and Temporal Reasoning Systems, Experimental Design for Real-World Systems, Human Behavior Modeling, Intelligent Event Processing, Intelligent Narrative Technologies II, Learning by Reading and Learning to Read, Social Semantic Web: Where Web 2.0 Meets Web 3.0, and Technosocial Predictive Analytics. The goal of the Agents that Learn from Human Teachers was to investigate how we can enable software and robotics agents to learn from real-time interaction with an everyday human partner. The aim of the Benchmarking of Qualitative Spatial and Temporal Reasoning Systems symposium was to initiate the development of a problem repository in the field of qualitative spatial and temporal reasoning and identify a graded set of challenges for future midterm and long-term research. The Experimental Design symposium discussed the challenges of evaluating AI systems. The Human Behavior Modeling symposium explored reasoning methods for understanding various aspects of human behavior, especially in the context of designing intelligent systems that interact with humans. The Intelligent Event Processing symposium discussed the need for more AI-based approaches in event processing and defined a kind of research agenda for the field, coined as intelligent complex event processing (iCEP). The Intelligent Narrative Technologies II AAAI symposium discussed innovations, progress, and novel techniques in the research domain. The Learning by Reading and Learning to Read symposium explored two aspects of making natural language texts semantically accessible to, and processable by, machines. The Social Semantic Web symposium focused on the real-world grand challenges in this area. Finally, the Technosocial Predictive Analytics symposium explored new methods for anticipatory analytical thinking that provide decision advantage through the integration of human and physical models.
An Immune Inspired Approach to Anomaly Detection
Twycross, Jamie, Aickelin, Uwe
The immune system provides a rich metaphor for computer security: anomaly detection that works in nature should work for machines. However, early artificial immune system approaches for computer security had only limited success. Arguably, this was due to these artificial systems being based on too simplistic a view of the immune system. We present here a second generation artificial immune system for process anomaly detection. It improves on earlier systems by having different artificial cell types that process information. Following detailed information about how to build such second generation systems, we find that communication between cells types is key to performance. Through realistic testing and validation we show that second generation artificial immune systems are capable of anomaly detection beyond generic system policies. The paper concludes with a discussion and outline of the next steps in this exciting area of computer security.
SparseCodePicking: feature extraction in mass spectrometry using sparse coding algorithms
Alexandrov, Theodore, Steinhorst, Klaus, Keszoecze, Oliver, Schiffler, Stefan
Mass spectrometry (MS) is an important technique for chemical profiling which calculates for a sample a high dimensional histogram-like spectrum. A crucial step of MS data processing is the peak picking which selects peaks containing information about molecules with high concentrations which are of interest in an MS investigation. We present a new procedure of the peak picking based on a sparse coding algorithm. Given a set of spectra of different classes, i.e. with different positions and heights of the peaks, this procedure can extract peaks by means of unsupervised learning. Instead of an $l_1$-regularization penalty term used in the original sparse coding algorithm we propose using an elastic-net penalty term for better regularization. The evaluation is done by means of simulation. We show that for a large region of parameters the proposed peak picking method based on the sparse coding features outperforms a mean spectrum-based method. Moreover, we demonstrate the procedure applying it to two real-life datasets.
Statistical Decision Making for Authentication and Intrusion Detection
Dimitrakakis, Christos, Mitrokotsa, Aikaterini
User authentication and intrusion detection differ from standard classification problems in that while we have data generated from legitimate users, impostor or intrusion data is scarce or non-existent. We review existing techniques for dealing with this problem and propose a novel alternative based on a principled statistical decision-making view point. We examine the technique on a toy problem and validate it on complex real-world data from an RFID based access control system. The results indicate that it can significantly outperform the classical world model approach. The method could be more generally useful in other decision-making scenarios where there is a lack of adversary data.
Kronecker Graphs: An Approach to Modeling Networks
Leskovec, Jure, Chakrabarti, Deepayan, Kleinberg, Jon, Faloutsos, Christos, Ghahramani, Zoubin
How can we model networks with a mathematically tractable model that allows for rigorous analysis of network properties? Networks exhibit a long list of surprising properties: heavy tails for the degree distribution; small diameters; and densification and shrinking diameters over time. Most present network models either fail to match several of the above properties, are complicated to analyze mathematically, or both. In this paper we propose a generative model for networks that is both mathematically tractable and can generate networks that have the above mentioned properties. Our main idea is to use the Kronecker product to generate graphs that we refer to as "Kronecker graphs". First, we prove that Kronecker graphs naturally obey common network properties. We also provide empirical evidence showing that Kronecker graphs can effectively model the structure of real networks. We then present KronFit, a fast and scalable algorithm for fitting the Kronecker graph generation model to large real networks. A naive approach to fitting would take super- exponential time. In contrast, KronFit takes linear time, by exploiting the structure of Kronecker matrix multiplication and by using statistical simulation techniques. Experiments on large real and synthetic networks show that KronFit finds accurate parameters that indeed very well mimic the properties of target networks. Once fitted, the model parameters can be used to gain insights about the network structure, and the resulting synthetic graphs can be used for null- models, anonymization, extrapolations, and graph summarization.
Node discovery problem for a social network
Methods to solve a node discovery problem for a social network are presented. Covert nodes refer to the nodes which are not observable directly. They transmit the influence and affect the resulting collaborative activities among the persons in a social network, but do not appear in the surveillance logs which record the participants of the collaborative activities. Discovering the covert nodes is identifying the suspicious logs where the covert nodes would appear if the covert nodes became overt. The performance of the methods is demonstrated with a test dataset generated from computationally synthesized networks and a real organization.
A Data-Mining Approach to 3D Realistic Render Setup Assistance
Morcillo, Carlos Gonzalez (University of Castilla-La Mancha) | Lopez, Lorenzo Manuel Lopez (University of Castilla-La Mancha) | Sanchez, Jose Jesus Castro (University of Castilla-La Mancha) | Moser, Bernhard (Software Competence Center GmbH)
Realistic rendering is the process of generating a 2D image from an abstract description of a 3D scene, aiming at achieving the quality of a photo. The quality of the generated image depends on the accuracy with which the employed render method simulates the behaviour of the light particles through the scene. According to the current practice, it is up to the user to choose optimal settings of input parameters for these methods in terms of time-efficiency, as well as image quality. This is an iterative trial and error process, even for expert users. This paper describes a novel approach based on techniques from the field of data mining and genetic computing to assist the user in the selection of render parameters. Experimental results are presented which show the benefits of this approach.
Explicit probabilistic models for databases and networks
Recent work in data mining and related areas has highlighted the importance of the statistical assessment of data mining results. Crucial to this endeavour is the choice of a non-trivial null model for the data, to which the found patterns can be contrasted. The most influential null models proposed so far are defined in terms of invariants of the null distribution. Such null models can be used by computation intensive randomization approaches in estimating the statistical significance of data mining results. Here, we introduce a methodology to construct non-trivial probabilistic models based on the maximum entropy (MaxEnt) principle. We show how MaxEnt models allow for the natural incorporation of prior information. Furthermore, they satisfy a number of desirable properties of previously introduced randomization approaches. Lastly, they also have the benefit that they can be represented explicitly. We argue that our approach can be used for a variety of data types. However, for concreteness, we have chosen to demonstrate it in particular for databases and networks.
Multiple Hypothesis Testing in Pattern Discovery
Hanhijärvi, Sami, Puolamäki, Kai, Garriga, Gemma C.
The problem of multiple hypothesis testing arises when there are more than one hypothesis to be tested simultaneously for statistical significance. This is a very common situation in many data mining applications. For instance, assessing simultaneously the significance of all frequent itemsets of a single dataset entails a host of hypothesis, one for each itemset. A multiple hypothesis testing method is needed to control the number of false positives (Type I error). Our contribution in this paper is to extend the multiple hypothesis framework to be used with a generic data mining algorithm. We provide a method that provably controls the family-wise error rate (FWER, the probability of at least one false positive) in the strong sense. We evaluate the performance of our solution on both real and generated data. The results show that our method controls the FWER while maintaining the power of the test.
Node discovery in a networked organization
In this paper, I present a method to solve a node discovery problem in a networked organization. Covert nodes refer to the nodes which are not observable directly. They affect social interactions, but do not appear in the surveillance logs which record the participants of the social interactions. Discovering the covert nodes is defined as identifying the suspicious logs where the covert nodes would appear if the covert nodes became overt. A mathematical model is developed for the maximal likelihood estimation of the network behind the social interactions and for the identification of the suspicious logs. Precision, recall, and F measure characteristics are demonstrated with the dataset generated from a real organization and the computationally synthesized datasets. The performance is close to the theoretical limit for any covert nodes in the networks of any topologies and sizes if the ratio of the number of observation to the number of possible communication patterns is large.