Media
Automatic Generation of Social Tags for Music Recommendation
Eck, Douglas, Lamere, Paul, Bertin-mahieux, Thierry, Green, Stephen
Social tags are user-generated keywords associated with some resource on the Web. In the case of music, social tags have become an important component of Web2.0" recommender systems, allowing users to generate playlists based on use-dependent terms such as "chill" or "jogging" that have been applied to particular songs. In this paper, we propose a method for predicting these social tags directly from MP3 files. Using a set of boosted classifiers, we map audio features onto social tags collected from the Web. The resulting automatic tags (or "autotags") furnish information about music that is otherwise untagged or poorly tagged, allowing for insertion of previously unheard music into a social recommender. This avoids the ''cold-start problem'' common in such systems. Autotags can also be used to smooth the tag space from which similarities and recommendations are made by providing a set of comparable baseline tags for all tracks in a recommender system."
Supervised Topic Models
Mcauliffe, Jon D., Blei, David M.
We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and web page popularity predicted from text descriptions. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression.
The Information Ecology of Social Media and Online Communities
Finin, Tim (University of Maryland, Baltimore County) | Joshi, Anupam (University of Maryland, Baltimore County) | Kolari, Pranam (Yahoo! Applied Research) | Java, Akshay (University of Maryland, Baltimore County) | Kale, Anubhav (Microsoft) | Karandikar, Amit (Microsoft)
Citizens, both young and feeds, and semistructured metadata old, are also discovering how social media in the form of extensible markup language technology can improve their lives and (XML) and resource description give them more voice in the world. We they provide more useful, trustworthy, begin by describing an overarching task of and reliable. Pursuing this task uncovers It differs, however, in ways a number of problems that must be addressed, that affect how it should be modeled, analyzed, three of which we describe in and exploited. The first is recognizing spam model for the general web is as a directed graph of web pages with undifferentiated in the form of spam blogs (splogs) and links between pages. The second is developing has a much richer network structure more effective techniques to recognize in that there are more types of nodes the social structure of blog communities. For example, the abstract model for the underlying blog people who contribute to blogs and au-network structure and how it evolves. Figure 2 shows a hypothetical blog graph and its corresponding flow of information in the influence graph. Studies on influence in social networks and collaboration graphs have typically focused on the task of identifying key individuals who play an important role in propagating information. This is similar to finding authoritative pages on the web.
From Data to the p-Adic or Ultrametric Model
We model anomaly and change in data by embedding the data in an ultrametric space. Taking our initial data as cross-tabulation counts (or other input data formats), Correspondence Analysis allows us to endow the information space with a Euclidean metric. We then model anomaly or change by an induced ultrametric. The induced ultrametric that we are particularly interested in takes a sequential - e.g. temporal - ordering of the data into account. We apply this work to the flow of narrative expressed in the film script of the Casablanca movie; and to the evolution between 1988 and 2004 of the Colombian social conflict and violence.
The Correspondence Analysis Platform for Uncovering Deep Structure in Data and Information
We study two aspects of information semantics: (i) the collection of all relationships, (ii) tracking and spotting anomaly and change. The first is implemented by endowing all relevant information spaces with a Euclidean metric in a common projected space. The second is modelled by an induced ultrametric. A very general way to achieve a Euclidean embedding of different information spaces based on cross-tabulation counts (and from other input data formats) is provided by Correspondence Analysis. From there, the induced ultrametric that we are particularly interested in takes a sequential - e.g. temporal - ordering of the data into account. We employ such a perspective to look at narrative, "the flow of thought and the flow of language" (Chafe). In application to policy decision making, we show how we can focus analysis in a small number of dimensions.
The Structure of Narrative: the Case of Film Scripts
Murtagh, Fionn, Ganz, Adam, McKie, Stewart
We analyze the style and structure of story narrative using the case of film scripts. The practical importance of this is noted, especially the need to have support tools for television movie writing. We use the Casablanca film script, and scripts from six episodes of CSI (Crime Scene Investigation). For analysis of style and structure, we quantify various central perspectives discussed in McKee's book, "Story: Substance, Structure, Style, and the Principles of Screenwriting". Film scripts offer a useful point of departure for exploration of the analysis of more general narratives. Our methodology, using Correspondence Analysis, and hierarchical clustering, is innovative in a range of areas that we discuss. In particular this work is groundbreaking in taking the qualitative analysis of McKee and grounding this analysis in a quantitative and algorithmic framework.
Knowledge Technologies
Several technologies are emerging that provide new ways to capture, store, present and use knowledge. This book is the first to provide a comprehensive introduction to five of the most important of these technologies: Knowledge Engineering, Knowledge Based Engineering, Knowledge Webs, Ontologies and Semantic Webs. For each of these, answers are given to a number of key questions (What is it? How does it operate? How is a system developed? What can it be used for? What tools are available? What are the main issues?). The book is aimed at students, researchers and practitioners interested in Knowledge Management, Artificial Intelligence, Design Engineering and Web Technologies. During the 1990s, Nick worked at the University of Nottingham on the application of AI techniques to knowledge management and on various knowledge acquisition projects to develop expert systems for military applications. In 1999, he joined Epistemics where he worked on numerous knowledge projects and helped establish knowledge management programmes at large organisations in the engineering, technology and legal sectors. He is author of the book "Knowledge Acquisition in Practice", which describes a step-by-step procedure for acquiring and implementing expertise. He maintains strong links with leading research organisations working on knowledge technologies, such as knowledge-based engineering, ontologies and semantic technologies.
Differential Entropic Clustering of Multivariate Gaussians
Davis, Jason V., Dhillon, Inderjit S.
Gaussian data is pervasive and many learning algorithms (e.g., k-means) model their inputs as a single sample drawn from a multivariate Gaussian. However, in many real-life settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. Such data can arise, for example, in a movie review database where each movie is rated by several users, or in time-series domains such as sensor networks. Here, each input can be naturally described by both a mean vector and covariance matrix which parameterize the Gaussian distribution. In this paper, we consider the problem of clustering such input objects, each represented as a multivariate Gaussian. We formulate the problem using an information theoretic approach and draw several interesting theoretical connections to Bregman divergences and also Bregman matrix divergences. We evaluate our method across several domains, including synthetic data, sensor network data, and a statistical debugging application.
Isotonic Conditional Random Fields and Local Sentiment Flow
We examine the problem of predicting local sentiment flow in documents, and its application to several areas of text analysis. Formally, the problem is stated as predicting an ordinal sequence based on a sequence of word sets. In the spirit of isotonic regression, we develop a variant of conditional random fields that is wellsuited to handle this problem. Using the Möbius transform, we express the model as a simple convex optimization problem. Experiments demonstrate the model and its applications to sentiment prediction, style analysis, and text summarization.