Goto

Collaborating Authors

 Government


Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors

AAAI Conferences

In this paper, we extend existing work on latent attribute inference by leveraging the principle of homophily: we evaluate the inference accuracy gained by augmenting the user features with features derived from the Twitter profiles and postings of her friends. We consider three attributes which have varying degrees of assortativity: gender, age, and political affiliation. Our approach yields a significant and robust increase in accuracy for both age and political affiliation, indicating that our approach boosts performance for attributes with moderate to high assortativity. Furthermore, different neighborhood subsets yielded optimal performance for different attributes, suggesting that different subsamples of the user's neighborhood characterize different aspects of the user herself. Finally, inferences using only the features of a user's neighbors outperformed those based on the user's features alone. This suggests that the neighborhood context alone carries substantial information about the user.


The Livehoods Project: Utilizing Social Media to Understand the Dynamics of a City

AAAI Conferences

Studying the social dynamics of a city on a large scale has tra- ditionally been a challenging endeavor, requiring long hours of observation and interviews, usually resulting in only a par- tial depiction of reality. At the same time, the boundaries of municipal organizational units, such as neighborhoods and districts, are largely statically defined by the city government and do not always reflect the character of life in these ar- eas. To address both difficulties, we introduce a clustering model and research methodology for studying the structure and composition of a city based on the social media its res- idents generate. We use data from approximately 18 million check-ins collected from users of a location-based online so- cial network. The resulting clusters, which we call Livehoods, are representations of the dynamic urban areas that comprise the city. We take an interdisciplinary approach to validating these clusters, interviewing 27 residents of Pittsburgh, PA, to see how their perceptions of the city project onto our findings there. Our results provide strong support for the discovered clusters, showing how Livehoods reveal the distinctly charac- terized areas of the city and the forces that shape them.


Modeling Polarizing Topics: When Do Different Political Communities Respond Differently to the Same News?

AAAI Conferences

Political discourse in the United States is getting increasingly polarized. This polarization frequently causes different communities to react very differently to the same news events. Political blogs as a form of social media provide an unique insight into this phenomenon. We present a multitarget, semisupervised latent variable model, MCR-LDA to model this process by analyzing political blogs posts and their comment sections from different political communities jointly to predict the degree of polarization that news topics cause. Inspecting the model after inference reveals topics and the degree to which it triggers polarization. In this approach, community responses to news topics are observed using sentiment polarity and comment volume which serves as a proxy for the level of interest in the topic. In this context, we also present computational methods to assign sentiment polarity to the comments which serve as targets for latent variable models that predict the polarity based on the topics in the blog content. Our results show that the joint modeling of communities with different political beliefs using MCR-LDA does not sacrifice accuracy in sentiment polarity prediction when compared to approaches that are tailored to specific communities and additionally provides a view of the polarization in responses from the different communities.


Vector-valued Reproducing Kernel Banach Spaces with Applications to Multi-task Learning

arXiv.org Machine Learning

The purpose of this paper is to establish the notion of vector-valued reproducing kernel Banach spaces and demonstrate its applications to multi-task machine learning. Built on the theory of scalar-valued reproducing kernel Hilbert spaces (RKHS) [3], kernel methods have been proven successful in single task machine learning [10, 14, 29, 30, 33]. Multi-task learning where the unknown target function to be learned from finite sample data is vector-valued appears more often in practice. References [13, 25] proposed the development of kernel methods for learning multiple related tasks simultaneously. The mathematical foundation used there was the theory of vector-valued RKHS [5, 27].


The Future of Search and Discovery in Big Data Analytics: Ultrametric Information Spaces

arXiv.org Machine Learning

Under the heading of "Addressing the big data challenge", the European 7th Framework Programme sees the issue thus (see INFSO, 2012): "Recent industry reports detail how data volumes are growing at a faster rate than our ability to interpret and exploit them for innovative ICT applications, for decision support, planning, monitoring, control and interaction. This includes unstructured data types such as video, audio, images and free text as well as structured data types such as database records, sensor readings and 3D. While each of these types requires some specific form of processing and analytics, many of the general principles for managing and storing them at extreme scales are common across all of them." Analytics tool capability is called for, to address these burgeoning issues in the data intensive industries, to support "effective policy making and implementation" of public bodies resulting in "significant annual savings from 1 Big Data applications", and also to exploit open, linked data - "foster the reuse of public sector information and strengthen other open data activities linked to commercial exploitation." The "big data" marketplace is stated to be potentially worth approximately USD 600 billion. To address the challenges of search and discovery in massive and complex data sets and data flows, it is our contention in this work that we must move to an appropriate topology - to an appropriate framework such that computation is greatly facilitated. Our work is all about empowering those who are involved in data analytics, through clustering and related algorithms, to face these new challenges. Scalability and interactivity are two of the performance issues that follow directly from clustering algorithms, for search, retrieval and discovery, that are of linear computational complexity or better (logarithmic, or constant).


Generalized Fisher Score for Feature Selection

arXiv.org Machine Learning

Fisher score is one of the most widely used supervised feature selection methods. However, it selects each feature independently according to their scores under the Fisher criterion, which leads to a suboptimal subset of features. In this paper, we present a generalized Fisher score to jointly select features. It aims at finding an subset of features, which maximize the lower bound of traditional Fisher score. The resulting feature selection problem is a mixed integer programming, which can be reformulated as a quadratically constrained linear programming (QCLP). It is solved by cutting plane algorithm, in each iteration of which a multiple kernel learning problem is solved alternatively by multivariate ridge regression and projected gradient descent. Experiments on benchmark data sets indicate that the proposed method outperforms Fisher score as well as many other state-of-the-art feature selection methods.


Learning Determinantal Point Processes

arXiv.org Artificial Intelligence

Determinantal point processes (DPPs), which arise in random matrix theory and quantum physics, are natural models for subset selection problems where diversity is preferred. Among many remarkable properties, DPPs offer tractable algorithms for exact inference, including computing marginal probabilities and sampling; however, an important open question has been how to learn a DPP from labeled training data. In this paper we propose a natural feature-based parameterization of conditional DPPs, and show how it leads to a convex and efficient learning formulation. We analyze the relationship between our model and binary Markov random fields with repulsive potentials, which are qualitatively similar but computationally intractable. Finally, we apply our approach to the task of extractive summarization, where the goal is to choose a small subset of sentences conveying the most important information from a set of documents. In this task there is a fundamental tradeoff between sentences that are highly relevant to the collection as a whole, and sentences that are diverse and not repetitive. Our parameterization allows us to naturally balance these two characteristics. We evaluate our system on data from the DUC 2003/04 multi-document summarization task, achieving state-of-the-art results.


Citizen Science: Contributions to Astronomy Research

arXiv.org Artificial Intelligence

In particular, the Zooniverse projects have demonstrated that research projects can significantly benefit from large numbers of participants in cases especially where human cognitive abilities can supplement automated data analysis. Initial results have shown that for observatories collecting large, sometimes complicated and also survey type datasets, Zooniverse methodology produces robust results as well as serendipitous discoveries. Specifically, citizen scientists have contributed to the results from the large SDSS sky survey, the concentrated transient/planet finding studies from the NASA Kepler mission, characterization of lunar craters and features from the Lunar Reconnaissance Observatory, and the galaxy morphology studies from HST Treasury programs, to name a few. Selection of projects is critical if we are not to waste the time of volunteers or to fail to meet the goal of providing authentic engagement with research. Basic data analysis task should, where possible, be automated rather than thoughtlessly passed to citizen scientists.


Paradoxes of Multiple Elections: An Approximation Approach

AAAI Conferences

When agents need to make decisions on multiple issues, applying common voting rules becomes computationally hard due to the exponentially large number of alternatives. One computationally efficient solution is to vote on the issues sequentially. In this paper, we investigate how well the winner under the sequential voting process approximates the winners under some common voting rules that admit natural scoring functions that can serve as a basis for approximation results. We focus on multi-issue domains where each issue is binary and the agents' preferences are O-legal, separable, represented by LP-trees, or lexicographic. We show some generalized paradoxes of multiple elections: Sequential voting does not approximate many common voting rules well even when the preferences are O-legal or separable. However, these paradoxes are much alleviated or even completely avoided when the preferences are lexicographic or represented by LP-trees. Our results thus draw a border for conditions under which sequential voting rules, which have extremely low com- putational and communicational cost, are good approximations of some common voting rules w.r.t. their corresponding scoring functions.


Lecture in Remembrance of John McCarthy

AAAI Conferences

McCarthy's strengths as both theoretician and engineer, John McCarthy, famous for his role in the development and explore how these drosophilae shaped his of time-sharing, for inventing the computer research. Since is talk analyzes McCarthy's myriad contributions 2010, she has served as principal investigator of the to artificial intelligence and knowledge representation Evaluation and Knowledge Infrastructure Team for through the set of drosophilae that he proposed, DARPA's Machine Reading Program.