Data Mining
An Investigation of Sensitivity on Bagging Predictors: An Empirical Approach
Liang, Guohua (University of Technology, Sydney)
As growing numbers of real world applications involve imbalanced class distribution or unequal costs for mis- classification errors in different classes, learning from imbalanced class distribution is considered to be one of the most challenging issues in data mining research. This study empirically investigates the sensitivity of bagging predictors with respect to 12 algorithms and 9 levels of class distribution on 14 imbalanced data-sets by using statistical and graphical methods to address the important issue of understanding the effect of vary- ing levels of class distribution on bagging predictors. The experimental results demonstrate that bagging NB and MLP are insensitive to various levels of imbalanced class distribution.
Rule Ensemble Learning Using Hierarchical Kernels in Structured Output Spaces
Nair, Naveen (IITB-Monash Research Academy, IIT Bombay, Monash University.) | Saha, Amrita (IIT Bombay) | Ramakrishnan, Ganesh (IIT Bombay) | Krishnaswamy, Shonali (Institute for Infocomm Research (I2R), Singapore)
The goal in Rule Ensemble Learning (REL) is simultaneous discovery of a small set of simple rules and their optimal weights that lead to good generalization. Rules are assumed to be conjunctions of basic propositions concerning the values taken by the input features. It has been shown that rule ensembles for classification can be learnt optimally and efficiently using hierarchical kernel learning approaches that explore the exponentially large space of conjunctions by exploiting its hierarchical structure. The regularizer employed penalizes large features and thereby selects a small set of short features. In this paper, we generalize the rule ensemble learning using hierarchical kernels (RELHKL) framework to multi class structured output spaces. We build on the StructSVM model for sequence prediction problems and employ a ρ-norm hierarchical regularizer for observation features and a conventional 2-norm regularizer for state transition features. The exponentially large feature space is searched using an active set algorithm and the exponentially large set of constraints are handled using a cutting plane algorithm. The approach can be easily extended to other structured output problems. We perform experiments on activity recognition datasets which are prone to noise, sparseness and skewness. We demonstrate that our approach outperforms other approaches.
Preface
Jannach, Dietmar (TU Dortmund)
Thee technical program of this workshop consists of presentations of recent, high-quality research contributions, which were selected by the workshop's international program committee in a peer review process. Five long papers and three short papers were accepted for presentation. The papers address a variety of topics in the context of personalization and recommender systems such as new techniques for group recommendation; user modeling and recommendation on the social web; automated content analysis for personalization and recommendation and mobile advertising.
Capturing Browsing Interests of Users into Web Usage Profiles
Kabir, Shaily (Concordia University) | Mudur, Sudhir P. (Concordia University) | Shiri, Nematollaah (Concordia University)
We present a new weighted session similarity measure to capture the browsing interests of users in web usage profiles discovered from web log data. We base our similarity measure on the reasonable assumption that when users spend longer times on pages or revisit pages in the same session, then very likely, such pages are of greater interest to the user. The proposed similarity measure combines structural similarity with session-wise page significance. The latter, representing the degree of user interest, is computed using frequency and duration of a page access. Web usage profiles are generated using this similarity measure by applying a fuzzy clustering algorithm to web log data. For evaluating the effectiveness of the proposed measure, we adapt two model-based collaborative filtering algorithms for recommending pages. Experimental results show considerable improvement in overall performance of recommender systems as compared to use of other existing similarity measures.
Incorporating Computational Sustainability into AI Education through a Freely-Available, Collectively-Composed Supplementary Lab Text
Fisher, Douglas H. (Vanderbilt University) | Dilkina, Bistra (Cornell University) | Eaton, Eric (Bryn Mawr College) | Gomes, Carla (Cornell University)
We introduce a laboratory text on environmental and societal sustainability applications that can be a supplemental resource for any undergraduate AI course. The lab text, entitled Artificial Intelligence for Computational Sustainability: A Lab Companion, is brand new and incomplete; freely available through Wikibooks; and open to community additions of projects, assignments, and explanatory material on AI for sustainability. The project adds to existing educational efforts of the computational sustainability community, encouraging the flow of knowledge from research to education and public outreach. Besides summarizing the laboratory book, this paper touches on its implications for integration of research and education, for communicating science to the public, and other broader impacts.
Ultrametric Model of Mind, II: Application to Text Content Analysis
In a companion paper, Murtagh (2012), we discussed how Matte Blanco's work linked the unrepressed unconscious (in the human) to symmetric logic and thought processes. We showed how ultrametric topology provides a most useful representational and computational framework for this. Now we look at the extent to which we can find ultrametricity in text. We use coherent and meaningful collections of nearly 1000 texts to show how we can measure inherent ultrametricity. On the basis of our findings we hypothesize that inherent ultrametricty is a basis for further exploring unconscious thought processes.
Comparative Study for Inference of Hidden Classes in Stochastic Block Models
Zhang, Pan, Krzakala, Florent, Reichardt, Jörg, Zdeborová, Lenka
Inference of hidden classes in stochastic block model is a classical problem with important applications. Most commonly used methods for this problem involve na\"{\i}ve mean field approaches or heuristic spectral methods. Recently, belief propagation was proposed for this problem. In this contribution we perform a comparative study between the three methods on synthetically created networks. We show that belief propagation shows much better performance when compared to na\"{\i}ve mean field and spectral approaches. This applies to accuracy, computational efficiency and the tendency to overfit the data.
An Integrated, Conditional Model of Information Extraction and Coreference with Applications to Citation Matching
Wellner, Ben, McCallum, Andrew, Peng, Fuchun, Hay, Michael
Although information extraction and coreference resolution appear together in many applications, most current systems perform them as ndependent steps. This paper describes an approach to integrated inference for extraction and coreference based on conditionally-trained undirected graphical models. We discuss the advantages of conditional probability training, and of a coreference model structure based on graph partitioning. On a data set of research paper citations, we show significant reduction in error by using extraction uncertainty to improve coreference citation matching accuracy, and using coreference to improve the accuracy of the extracted fields.
Forecasting electricity consumption by aggregating specialized experts
Devaine, Marie, Gaillard, Pierre, Goude, Yannig, Stoltz, Gilles
We consider the setting of sequential prediction of arbitrary sequences based on specialized experts. We first provide a review of the relevant literature and present two theoretical contributions: a general analysis of the specialist aggregation rule of Freund et al. (1997) and an adaptation of fixed-share rules of Herbster and Warmuth (1998) in this setting. We then apply these rules to the sequential short-term (one-day-ahead) forecasting of electricity consumption; to do so, we consider two data sets, a Slovakian one and a French one, respectively concerned with hourly and half-hourly predictions. We follow a general methodology to perform the stated empirical studies and detail in particular tuning issues of the learning parameters. The introduced aggregation rules demonstrate an improved accuracy on the data sets at hand; the improvements lie in a reduced mean squared error but also in a more robust behavior with respect to large occasional errors.
Relational Data Mining Through Extraction of Representative Exemplars
Blanchard, Frédéric, Herbin, Michel
With the growing interest on Network Analysis, Relational Data Mining is becoming an emphasized domain of Data Mining. This paper addresses the problem of extracting representative elements from a relational dataset. After defining the notion of degree of representativeness, computed using the Borda aggregation procedure, we present the extraction of exemplars which are the representative elements of the dataset. We use these concepts to build a network on the dataset. We expose the main properties of these notions and we propose two typical applications of our framework. The first application consists in resuming and structuring a set of binary images and the second in mining co-authoring relation in a research team.