Genre
Utility-Based Fraud Detection
Torgo, Luis (LIAAD - Inesc Porto LA) | Lopes, Elsa (LIAAD - Inesc Porto LA)
Fraud detection is a key activity with serious socio-economical impact. Inspection activities associated with this task are usually constrained by limited available resources. Data analysis methods can provide help in the task of deciding where to allocate these limited resources in order to optimise the outcome of the inspection activities. This paper presents a multi-strategy learning method to address the question of which cases to inspect first. The proposed methodology is based on the utility theory and provides a ranking ordered by decreasing expected outcome of inspecting the candidate cases. This outcome is a function not only of the probability of the case being fraudulent but also of the inspection costs and expected payoff if the case is confirmed as a fraud. The proposed methodology is general and can be useful on fraud detection activities with limited inspection resources. We experimentally evaluate our proposal on both an artificial domain and on a real world task.
Fast Anomaly Detection for Streaming Data
Tan, Swee Chuan (SIM University) | Ting, Kai Ming (Monash University) | Liu, Tony Fei (Monash University)
This paper introduces Streaming Half-Space-Trees (HS-Trees), a fast one-class anomaly detector for evolving data streams. It requires only normal data for training and works well when anomalous data are rare. The model features an ensemble of random HS-Trees, and the tree structure is constructed without any data. This makes the method highly efficient because it requires no model restructuring when adapting to evolving data streams. Our analysis shows that Streaming HS-Trees has constant amortised time complexity and constant memory requirement. When compared with a state-of-the-art method, our method performs favourably in terms of detection accuracy and runtime performance. Our experimental results also show that the detection performance of Streaming HS-Trees is not sensitive to its parameter settings.
Active Surveying: A Probabilistic Approach for Identifying Key Opinion Leaders
Sharara, Hossam (University of Maryland, College Park) | Getoor, Lise (University of Maryland, College Park) | Norton, Myra (Community Analytics, Baltimore)
Opinion leaders play an important role in influencing peopleโs beliefs, actions and behaviors. Although a number of methods have been proposed for identifying influentials using secondary sources of information, the use of primary sources, such as surveys, is still favored in many domains. In this work we present a new surveying method which combines secondary data with partial knowledge from primary sources to guide the information gathering process. We apply our proposed active surveying method to the problem of identifying key opinion leaders in the medical field, and show how we are able to accurately identify the opinion leaders while minimizing the amount of primary data required, which results in significant cost reduction in data acquisition without sacrificing its integrity.
Classification of Emerging Extreme Event Tracks in Multivariate Spatio-Temporal Physical Systems Using Dynamic Network Structures: Application to Hurricane Track Prediction
Sencan, Huseyin (North Carolina State University) | Chen, Zhengzhang (North Carolina State University) | Hendrix, William (Northwestern University) | Pansombut, Tatdow (North Carolina State University) | Semazzi, Frederick (North Carolina State University) | Choudhary, Alok (North Carolina State University) | Kumar, Vipin (University of Minnesota) | Melechko, Anatoli V. (North Carolina State University) | Samatova, Nagiza F. (Oak Ridge National Laboratory)
Understanding extreme events, such as hurricanes or forest fires, is of paramount importance because of their adverse impacts on human beings. Such events often propagate in space and time. Predicting-even a few days in advance-what locations will get affected by the event tracks could benefit our society in many ways. Arguably, simulations from โfirst principles,โ where underlying physics-based models are described by a system of equations, provide least reliable predictions for variables characterizing the dynamics of these extreme events. Data-driven model building has been recently emerging as a complementary approach that could learn the relationships between historically observed or simulated multiple, spatio-temporal ancillary variables and the dynamic behavior of extreme events of interest. While promising, the methodology for predictive learning from such complex data is still in its infancy. In this paper, we propose a dynamic networks-based methodology for in-advance prediction of the dynamic tracks of emerging extreme events. By associating a network model of the system with the known tracks, our method is capable of learning the recurrent network motifs that could be used as discriminatory signatures for the event's behavioral class. When applied to classifying the behavior of the hurricane tracks at their early formation stages in Western Africa region, our method is able to predict whether hurricane tracks will hit the land of the North Atlantic region at least 10-15 days lead lag time in advance with more than 90%ย accuracy using 10-fold cross-validation. To the best of our knowledge, no comparable methodology exists for solving this problem using data-driven models.
Discovering Deformable Motifs in Continuous Time Series Data
Saria, Suchi (Stanford University) | Duchi, Andrew (Stanford University) | Koller, Daphne (Stanford University)
Continuous time series data often comprise or contain repeated motifs โ patterns that have similar shape, and yet exhibit nontrivial variability. Identifying these motifs, even in the presence of variation, is an important subtask in both unsupervised knowledge discovery and constructing useful features for discriminative tasks. This paper addresses this task using a probabilistic framework that models generation of data as switching between a random walk state and states that generate motifs. A motif is generated from a continuous shape template that can undergo non-linear transformations such as temporal warping and additive noise. We propose an unsupervised algorithm that simultaneously discovers both the set of canonical shape templates and a template-specific model of variability manifested in the data. Experimental results on three real-world data sets demonstrate that our model is able to recover templates in data where repeated instances show large variability. The recovered templates provide higher classification accuracy and coverage when compared to those from alternatives such as random projection based methods and simpler generative models that do not model variability. Moreover, in analyzing physiological signals from infants in the ICU, we discover both known signatures as well as novel physiomarkers.
Domain Adaptation with Ensemble of Feature Groups
Samdani, Rajhans Yih (University of Illinois at Urbana-Champaign) | Yih, Wen-tau (Microsoft Research)
We present a novel approach for domain adaptation based on feature grouping and re-weighting. Our algorithm operates by creating an ensemble of multiple classifiers, where each classifier is trained on one particular feature group. Faced with the distribution change involved in domain change, different feature groups exhibit different cross-domain prediction abilities. Herein, ensemble models provide us the flexibility of tuning the weights of corresponding classifiers in order to adapt to the new domain. Our approach is supported by a solid theoretical analysis based on the expressiveness of ensemble classifiers, which allows trading-off errors across source and target domains. Moreover, experimental results on sentiment classification and spam detection show that our approach not only outperforms the baseline method, but is also superior to other state-of-the-art methods.
Biclustering-Driven Ensemble of Bayesian Belief Network Classifiers for Underdetermined Problems
Pansombut, Tatdow (North Carolina State University, Oak Ridge National Laboratory) | Hendrix, William (North Carolina State University, Oak Ridge National Laboratory) | Gao, Zekai J. (Zhejiang University) | Harrison, Brent E. (North Carolina State University, Oak Ridge National Laboratory) | Samatova, Nagiza F. (North Carolina State University, Oak Ridge National Laboratory)
In this paper, we present BENCH (BiclusteringdrivenENsemble of Classifiers), an algorithm toconstruct an ensemble of classifiers through concurrentfeature and data point selection guided byunsupervised knowledge obtained from biclustering.BENCH is designed for underdeterminedproblems. In our experiments, we use Bayesian BeliefNetwork (BBN) classifiers as base classifiers inthe ensemble; however, BENCH can be applied toother classification models as well. We show thatBENCH is able to increase prediction accuracy ofa single classifier and traditional ensemble of classifiersby up to 15% on three microarray datasetsusing various weighting schemes for combining individualpredictions in the ensemble.
Positive Unlabeled Learning for Time Series Classification
Nguyen, Minh Nhut (Institute for Infocomm Research, Singapore) | Li, Xiao-Li (Institute for Infocomm Research, Singapore) | Ng, See-Kiong (Institute for Infocomm Research, Singapore)
In many real-world applications of the time series classification problem, not only could the negative training instances be missing, the number of positive instances available for learning may also be rather limited. This has motivated the development of new classification algorithms that can learn from a small set P of labeled seed positive instances augmented with a set U of unlabeled instances (i.e. PU learning algorithms). However, existing PU learning algorithms for time series classification have less than satisfactory performance as they are unable to identify the class boundary between positive and negative instances accurately. In this paper, we propose a novel PU learning algorithm LCLC (Learning from Common Local Clusters) for time series classification. LCLC is designed to effectively identify the ground truthsโ positive and negative boundaries, resulting in more accurate classifiers than those constructed using existing methods. We have applied LCLC to classify time series data from different application domains; the experimental results demonstrate that LCLC outperforms existing methods significantly.
Combining Supervised and Unsupervised Models Via Unconstrained Probabilistic Embedding
Ma, Xudong (Chinese Academy of Sciences) | Luo, Ping (Hewlett Packard Labs China) | Zhuang, Fuzhen (Chinese Academy of Sciences) | He, Qing (Chinese Academy of Sciences) | Shi, Zhongzhi (Chinese Academy of Sciences) | Shen, Zhiyong (Hewlett Packard Labs China)
Ensemble learning with output from multiple supervised and unsupervised models aims to improvethe classification accuracy of supervised model ensembleby jointly considering the grouping results from unsupervised models. In this paper we cast this ensemble task as an unconstrained probabilistic embedding problem. Specifically, we assume both objects and classes/clusters have latent coordinates without constraints in a D -dimensional Euclidean space, and consider the mapping from the embedded space into the space of results from supervised and unsupervised models as a probabilistic generative process. The prediction of an objectis then determined by the distances between the objectand the classes in the embedded space. A solution of this embedding can be obtained using the quasi-Newton method, resulting in the objects and classes/clusters with high co-occurrence weights being embedded close. We demonstrate the benefits of this unconstrained embedding method by three real applications.
Ball Ranking Machines for Content-Based Multimedia Retrieval
Luo, Dijun (The University of Texas at Arlington) | Huang, Heng (The University of Texas at Arlington)
In this paper, we propose the new Ball Ranking Machines (BRMs) to address the supervised ranking problems. In previous work, supervised ranking methods have been successfully applied in various information retrieval tasks. Among these methodologies, the Ranking Support Vector Machines (Rank SVMs) are well investigated. However, one major fact limiting their applications is that Ranking SVMs need optimize a margin-based objective function over all possible document pairs within all queries on the training set. In consequence, Ranking SVMs need select a large number of support vectors among a huge number of support vector candidates. This paper introduces a new model of of Ranking SVMs and develops an efficient approximation algorithm, which decreases the training time and generates much fewer support vectors. Empirical studies on synthetic data and content-based image/video retrieval data show that our method is comparable to Ranking SVMs in accuracy, but use much fewer ranking support vectors and significantly less training time.