Chan, Jeffrey
Reconsidering Mutual Information Based Feature Selection: A Statistical Significance View
Vinh, Nguyen Xuan (The University of Melbourne) | Chan, Jeffrey (The University of Melbourne) | Bailey, James (The University of Melbourne)
Mutual information (MI) based approaches are a popular feature selection paradigm. Although the stated goal of MI-based feature selection is to identify a subset of features that share the highest mutual information with the class variable, most current MI-based techniques are greedy methods that make use of low dimensional MI quantities. The reason for using low dimensional approximation has been mostly attributed to the difficulty associated with estimating the high dimensional MI from limited samples. In this paper, we argue a different viewpoint that, given a very large amount of data, the high dimensional MI objective is still problematic to be employed as a meaningful optimization criterion, due to its overfitting nature: the MI almost always increases as more features are added, thus leading to a trivial solution which includes all features. We propose a novel approach to the MI-based feature selection problem, in which the overfitting phenomenon is controlled rigourously by means of a statistical test. We develop local and global optimization algorithms for this new feature selection model, and demonstrate its effectiveness in the applications of explaining variables and objects.
Reconsidering Mutual Information Based Feature Selection: A Statistical Significance View
Vinh, Nguyen Xuan (The University of Melbourne) | Chan, Jeffrey (The University of Melbourne) | Bailey, James (The University of Melbourne)
Mutual information (MI) based approaches are a popular feature selection paradigm. Although the stated goal of MI-based feature selection is to identify a subset of features that share the highest mutual information with the class variable, most current MI-based techniques are greedy methods that make use of low dimensional MI quantities. The reason for using low dimensional approximation has been mostly attributed to the difficulty associated with estimating the high dimensional MI from limited samples. In this paper, we argue a different viewpoint that, given a very large amount of data, the high dimensional MI objective is still problematic to be employed as a meaningful optimization criterion, due to its overfitting nature: the MI almost always increases as more features are added, thus leading to a trivial solution which includes all features. We propose a novel approach to the MI-based feature selection problem, in which the overfitting phenomenon is controlled rigourously by means of a statistical test. We develop local and global optimization algorithms for this new feature selection model, and demonstrate its effectiveness in the applications of explaining variables and objects.
Reconsidering Mutual Information Based Feature Selection: A Statistical Significance View
Vinh, Nguyen Xuan (The University of Melbourne) | Chan, Jeffrey (The University of Melbourne) | Bailey, James (The University of Melbourne)
Mutual information (MI) based approaches are a popular feature selection paradigm. Although the stated goal of MI-based feature selection is to identify a subset of features that share the highest mutual information with the class variable, most current MI-based techniques are greedy methods that make use of low dimensional MI quantities. The reason for using low dimensional approximation has been mostly attributed to the difficulty associated with estimating the high dimensional MI from limited samples. In this paper, we argue a different viewpoint that, given a very large amount of data, the high dimensional MI objective is still problematic to be employed as a meaningful optimization criterion, due to its overfitting nature: the MI almost always increases as more features are added, thus leading to a trivial solution which includes all features. We propose a novel approach to the MI-based feature selection problem, in which the overfitting phenomenon is controlled rigourously by means of a statistical test. We develop local and global optimization algorithms for this new feature selection model, and demonstrate its effectiveness in the applications of explaining variables and objects.
Mixed Membership Models for Exploring User Roles in Online Fora
White, Arthur J. (University College Dublin) | Chan, Jeffrey (University of Melbourne) | Hayes, Conor (National University Ireland Galway) | Murphy, Brendan (University College Dublin)
Discussion boards are a form of social media which allow users to discuss topics and exchange information in a complex manner, in a number of different settings. As the popularity of such message boards has increased, communities of users have emerged, and several prominent types of social role have been identified, such as Question Answerer, Celebrity, Discussion Person and Topic Initiator. Recent studies have noted the structural similarity of the egocentric network of users assigned the same role by qualitative criteria. In this paper a methodology is developed with which to cluster together users with similar ego-centric network structures. This is achieved using a mixed membership formulation which allows for the fact that different groups of users may have characteristics in common. The method is then applied to data taken from boards.ie, a medium sized message boards website. Prominent clusters of users are identified and discussed, and illustrative examples of user behaviour provided. The type of interaction, both locally and globally, taking place within forums is examined.
Reconstruction of Threaded Conversations in Online Discussion Forums
Aumayr, Erik (National University of Ireland, Galway) | Chan, Jeffrey (National University of Ireland, Galway) | Hayes, Conor (National University of Ireland, Galway)
Online discussion boards, or Internet forums, are a significant part of the Internet. People use Internet forums to post questions, provide advice and participate in discussions. These online conversations are represented as threads, and the conversation trees within these threads are important in understanding the behaviour of online users. Unfortunately, the reply structures of these threads are generally not publicly accessible or not maintained. Hence, in this paper, we introduce an efficient and simple approach to reconstruct the reply structure in threaded conversations. We contrast its accuracy against three baseline algorithms, and show that our algorithm can accurately recreate the in and out degree distributions of forum reply graphs built from the reconstructed reply structures.