Goto

Collaborating Authors

 Media


Visual Programming of Plan Dynamics Using Constraints and Landmarks

AAAI Conferences

In recent years, there has been considerable interest in the use of planning techniques in the area of new media. Many traditional planning notions no longer apply in the context of these applications. In particular, it can be difficult to answer the important question of what constitutes a good plan for the domain, but there is an emerging consensus that plan dynamics play an important role. As a consequence, it is important to support representation of such aspects. Our solution is to introduce a meta-level of representation that is an abstraction of the domain with respect to both time and causality, and to develop a visual representation of this in the form of a narrative arc. This visual representation can then be used in a visual programming approach to the exploration and specification of plan dynamics. In the paper we outline this approach to meta-level representation using constraints along with the visual programming interface we have developed. We illustrate the approach with examples of visual programming in the development of an interactive entertainment system based on Shakespeare's play ``The Merchant of Venice''


Learning Parameters of the K-Means Algorithm From Subjective Human Annotation

AAAI Conferences

The New York Public Library is participating in the Chronicling America initiative to develop an online searchable database of historically significant newspaper articles. Microfilm copies of the papers are scanned and high resolution OCR software is run on them. The text from the OCR provides a wealth of data and opinion for researchers and historians. However, the categorization of articles provided by the OCR engine is rudimentary and a large number of the articles are labeled ``editorial" without further categorization. To provide a more refined grouping of articles, unsupervised machine learning algorithms (such as K-Means) are being investigated. The K-Means algorithm requires tuning of parameters such as the number of clusters and mechanism of seeding to ensure that the search is not prone to being caught in a local minima. We designed a pilot study to observe whether humans are adept at finding sub-categories. The subjective labels provided by humans are used as a guide to compare performance of the automated clustering techniques. In addition, seeds provided by annotators are carefully incorporated into a semi-supervised K-Means algorithm (Seeded K-Means); empirical results indicate that this helps to improve performance and provides an intuitive sub-categorization of the articles labeled ``editorial" by the OCR engine.


Bias in Hard News Articles from Fox News and MSNBC: An Empirical Assessment Using the Gramulator

AAAI Conferences

Hard news articles, just like op-ed articles, can reflect a media organization's bias. This study assesses bias in the hard news articles published by Fox News and MSNBC. Indicative linguistic features identified by the Gramulator reveal biases in corpora from the two networks.


A Contrastive Corpus Analysis of Modern Art Criticism and Photography Criticism

AAAI Conferences

In this study, we analyze two corpora of art critiques: one on the subject of photography and the other on the subject of modern art. We use two computational tools, the Gramulator and GPAT to analyze both sets of texts. The Gramulator was used to show the indicative linguistic features that make photography criticism a distinct genre from modern art criticism. Results suggest that lexical features, structural formats, and genre consistency differed significantly between the two corpora. The findings provide information for teachers, students, publishers, and curriculum developers for creating more effective writing and teaching materials. This includes material for English for Specific Purposes (ESP) in the form of textbooks, workbooks and other external learning material.


Hybrid Approach Combining Machine Learning and a Rule-Based Expert System for Text Categorization

AAAI Conferences

This paper discusses a novel hybrid approach for text categorization that combines a machine learning algorithm, which provides a base model trained with a labeled corpus, with a rule-based expert system, which is used to improve the results provided by the previous classifier, by filtering false positives and dealing with false negatives. The main advantage is that the system can be easily fine-tuned by adding specific rules for those noisy or conflicting categories that have not been successfully trained. We also describe an implementation based on k-Nearest Neighbor and a simple rule language to express lists of positive, negative and relevant (multiword) terms appearing in the input text. The system is evaluated in several scenarios, including the popular Reuters-21578 news corpus for comparison to other approaches, and categorization using IPTC metadata, EUROVOC thesaurus and others. Results show that this approach achieves a precision that is comparable to top ranked methods, with the added value that it does not require a demanding human expert workload to train.


Exploring Interaction Between Images and Texts for Web Image Categorization

AAAI Conferences

With the rapid development of technologies for fast access to the Internet and the popularization of digital cameras, enormous digital images are posted and shared online everyday. Simultaneously, web images are usually organized by topics of events and are often assigned appropriate topic-related text descriptions. Given a set of images along with corresponding texts, a challenging problem is how to utilize the available information to perform image retrieval tasks, such as image classification and image clustering. Previous works on image categorization focus on either adopting text or image features, or simply combining these two types of information together. In this paper, we propose two novel approaches (Dynamic Weighting and Region-based Semantic Concept Integration) to categorize the images under the "supervision" of topic-related text descriptions; In addition, we provide a comparative experimental investigation on utilizing text and image information to tackle image classification. Empirical experiments on a manually collected image dataset (consisting of images related to the events after disasters) demonstrate the efficacy of our proposed classification methods.


Notes on a New Philosophy of Empirical Science

arXiv.org Machine Learning

This book presents a methodology and philosophy of empirical science based on large scale lossless data compression. In this view a theory is scientific if it can be used to build a data compression program, and it is valuable if it can compress a standard benchmark database to a small size, taking into account the length of the compressor itself. This methodology therefore includes an Occam principle as well as a solution to the problem of demarcation. Because of the fundamental difficulty of lossless compression, this type of research must be empirical in nature: compression can only be achieved by discovering and characterizing empirical regularities in the data. Because of this, the philosophy provides a way to reformulate fields such as computer vision and computational linguistics as empirical sciences: the former by attempting to compress databases of natural images, the latter by attempting to compress large text databases. The book argues that the rigor and objectivity of the compression principle should set the stage for systematic progress in these fields. The argument is especially strong in the context of computer vision, which is plagued by chronic problems of evaluation. The book also considers the field of machine learning. Here the traditional approach requires that the models proposed to solve learning problems be extremely simple, in order to avoid overfitting. However, the world may contain intrinsically complex phenomena, which would require complex models to understand. The compression philosophy can justify complex models because of the large quantity of data being modeled (if the target database is 100 Gb, it is easy to justify a 10 Mb model). The complex models and abstractions learned on the basis of the raw data (images, language, etc) can then be reused to solve any specific learning problem, such as face recognition or machine translation.


Business Listing Classification Using Case Based Reasoning and Joint Probability

AAAI Conferences

One challenge of building and maintaining large-scale data management systems is managing data fusion from multiple data sources. Often times, different data sources may represent the same data element in a slightly different way. These differences may represent an error in the data or a disagreement between sources on the correct value that best represents the data point. When the quantity of data managed and fused becomes sufficiently large, manual review becomes impossible, and automated systems must be built to manage data fusion. Some of the traditional solutions use simple voting theory, Dempster-Shafer theory, fuzzy matching and incremental learning. This paper presents a novel approach to data fusion in the domain of business listings. The task at hand, business listing categorization, suffers from conflicting and incomplete data from disparate data sources. Given the need for a high degree of accuracy in this task, we use a combination of case-based reasoning, joint probability, and domain-specific rules to improve data accuracy above other methods.


Emerging Topic Detection for Business Intelligence Via Predictive Analysis of 'Meme' Dynamics

AAAI Conferences

Detecting and characterizing emerging topics of discussion and consumer trends through analysis of Internet data is of great interest to businesses. This paper considers the problem of monitoring the Web to spot emerging memes – distinctive phrases which act as “tracers” for topics – as a means of early detection of new topics and trends. We present a novel methodology for predicting which memes will propagate widely, appearing in hundreds or thousands of blog posts, and which will not, thereby enabling discovery of significant topics. We begin by identifying measurables which should be predictive of meme success. Interestingly, these metrics are not those traditionally used for such prediction but instead are subtle measures of meme dynamics. These metrics form the basis for learning a classifier which predicts, for a given meme, whether or not it will propagate widely. The utility of the prediction methodology is demonstrated through analysis of a sample of 200 memes which emerged online during the second half of 2008.


Refining Recency Search Results with User Click Feedback

arXiv.org Artificial Intelligence

Traditional machine-learned ranking systems for web search are often trained to capture stationary relevance of documents to queries, which has limited ability to track non-stationary user intention in a timely manner. In recency search, for instance, the relevance of documents to a query on breaking news often changes significantly over time, requiring effective adaptation to user intention. In this paper, we focus on recency search and study a number of algorithms to improve ranking results by leveraging user click feedback. Our contributions are three-fold. First, we use real search sessions collected in a random exploration bucket for \emph{reliable} offline evaluation of these algorithms, which provides an unbiased comparison across algorithms without online bucket tests. Second, we propose a re-ranking approach to improve search results for recency queries using user clicks. Third, our empirical comparison of a dozen algorithms on real-life search data suggests importance of a few algorithmic choices in these applications, including generalization across different query-document pairs, specialization to popular queries, and real-time adaptation of user clicks.