Greene, Derek

Stability of Topic Modeling via Matrix Factorization Machine Learning

Topic models can provide us with an insight into the underlying latent structure of a large corpus of documents. A range of methods have been proposed in the literature, including probabilistic topic models and techniques based on matrix factorization. However, in both cases, standard implementations rely on stochastic elements in their initialization phase, which can potentially lead to different results being generated on the same corpus when using the same parameter values. This corresponds to the concept of "instability" which has previously been studied in the context of $k$-means clustering. In many applications of topic modeling, this problem of instability is not considered and topic models are treated as being definitive, even though the results may change considerably if the initialization process is altered. In this paper we demonstrate the inherent instability of popular topic modeling approaches, using a number of new measures to assess stability. To address this issue in the context of matrix factorization for topic modeling, we propose the use of ensemble learning strategies. Based on experiments performed on annotated text corpora, we show that a K-Fold ensemble strategy, combining both ensembles and structured initialization, can significantly reduce instability, while simultaneously yielding more accurate topic models.

Aggregating Content and Network Information to Curate Twitter User Lists Artificial Intelligence

Twitter introduced user lists in late 2009, allowing users to be grouped according to meaningful topics or themes. Lists have since been adopted by media outlets as a means of organising content around news stories. Thus the curation of these lists is important - they should contain the key information gatekeepers and present a balanced perspective on a story. Here we address this list curation process from a recommender systems perspective. We propose a variety of criteria for generating user list recommendations, based on content analysis, network analysis, and the "crowdsourcing" of existing user lists. We demonstrate that these types of criteria are often only successful for datasets with certain characteristics. To resolve this issue, we propose the aggregation of these different "views" of a news story on Twitter to produce more accurate user recommendations to support the curation process.

An Eigenvalue-Based Measure for Word-Sense Disambiguation

AAAI Conferences

Current approaches for word-sense disambiguation (WSD) try to relate the senses of the target words by optimizing a score for each sense in the context of all other words' senses. However, by scoring each sense separately, they often fail to optimize the relations between the resulting senses. We address this problem by proposing a HITS-inspired method that attempts to optimize the score for the entire sense combination rather than one-word-at-a-time. We also exploit word-sense disambiguation via topic-models, when retrieving senses from heterogeneous sense inventories. Although this entails the relaxation of several assumptions behind current WSD algorithms, we show that our proposed method E-WSD achieves better results than current state-of-the-art approaches, without the need for additional background knowledge.

Identifying Representative Textual Sources in Blog Networks

AAAI Conferences

We apply methods from social network analysis and visualization to facilitate a study of the Irish blogosphere from a cultural studies perspective.We focus on solving the practical issues that arise when the goal is to perform textual analysis of the corpus produced by a network of bloggers. Previous studies into blogging networks have noted difficulties arising when trying to identify the extent and boundaries of these networks. As a response to calls for increasingly data-led approaches in media and cultural studies, we discuss a variety of social network analysis methods that can be used to identify which blogs can be seen as members of a posited “Irish blogging network” (and hence sources of textual material). We identify hub blogs, communities of sites corresponding to different topics, and representative bloggers within these communities. Based on this study, we propose a set of guidelines for researchers who wish to map out blogging networks.

An Analysis of Current Trends in CBR Research Using Multi-View Clustering

AI Magazine

The European Conference on Case-Based Reasoning (CBR) in 2008 marked 15 years of international and European CBR conferences where almost seven hundred research papers were published. In this report we review the research themes covered in these papers and identify the topics that are active at the moment. The main mechanism for this analysis is a clustering of the research papers based on both co-citation links and text similarity. It is interesting to note that the core set of papers has attracted citations from almost three thousand papers outside the conference collection so it is clear that the CBR conferences are a sub-part of a much larger whole. It is remarkable that the research themes revealed by this analysis do not map directly to the sub-topics of CBR that might appear in a textbook. Instead they reflect the applications-oriented focus of CBR research, and cover the promising application areas and research challenges that are faced.