Goto

Collaborating Authors

 Information Technology


Multi-Task Metric Learning on Network Data

arXiv.org Machine Learning

Multi-task learning (MTL) improves prediction performance in different contexts by learning models jointly on multiple different, but related tasks. Network data, which are a priori data with a rich relational structure, provide an important context for applying MTL. In particular, the explicit relational structure implies that network data is not i.i.d. data. Network data also often comes with significant metadata (i.e., attributes) associated with each entity (node). Moreover, due to the diversity and variation in network data (e.g., multi-relational links or multi-category entities), various tasks can be performed and often a rich correlation exists between them. Learning algorithms should exploit all of these additional sources of information for better performance. In this work we take a metric-learning point of view for the MTL problem in the network context. Our approach builds on structure preserving metric learning (SPML). In particular SPML learns a Mahalanobis distance metric for node attributes using network structure as supervision, so that the learned distance function encodes the structure and can be used to predict link patterns from attributes. SPML is described for single-task learning on single network. Herein, we propose a multi-task version of SPML, abbreviated as MT-SPML, which is able to learn across multiple related tasks on multiple networks via shared intermediate parametrization. MT-SPML learns a specific metric for each task and a common metric for all tasks. The task correlation is carried through the common metric and the individual metrics encode task specific information. When combined together, they are structure-preserving with respect to individual tasks. MT-SPML works on general networks, thus is suitable for a wide variety of problems. In experiments, we challenge MT-SPML on two real-word problems, where MT-SPML achieves significant improvement.


Model-Parallel Inference for Big Topic Models

arXiv.org Machine Learning

In real world industrial applications of topic modeling, the ability to capture gigantic conceptual space by learning an ultra-high dimensional topical representation, i.e., the so-called "big model", is becoming the next desideratum after enthusiasms on "big data", especially for fine-grained downstream tasks such as online advertising, where good performances are usually achieved by regression-based predictors built on millions if not billions of input features. The conventional data-parallel approach for training gigantic topic models turns out to be rather inefficient in utilizing the power of parallelism, due to the heavy dependency on a centralized image of "model". Big model size also poses another challenge on the storage, where available model size is bounded by the smallest RAM of nodes. To address these issues, we explore another type of parallelism, namely model-parallelism, which enables training of disjoint blocks of a big topic model in parallel. By integrating data-parallelism with model-parallelism, we show that dependencies between distributed elements can be handled seamlessly, achieving not only faster convergence but also an ability to tackle significantly bigger model size. We describe an architecture for model-parallel inference of LDA, and present a variant of collapsed Gibbs sampling algorithm tailored for it. Experimental results demonstrate the ability of this system to handle topic modeling with unprecedented amount of 200 billion model variables only on a low-end cluster with very limited computational resources and bandwidth.


Wikipedia-Based Distributional Semantics for Entity Relatedness

AAAI Conferences

Wikipedia provides an enormous amount of background knowledge to reason about the semantic relatedness between two entities. We propose Wikipedia-based Distributional Semantics for Entity Relatedness (DiSER), which represents the semantics of an entity by its distribution in the high dimensional concept space derived from Wikipedia. DiSER measures the semantic relatedness between two entities by quantifying the distance between the corresponding high-dimensional vectors. DiSER builds the model by taking the annotated entities only, therefore it improves over existing approaches, which do not distinguish between an entity and its surface form. We evaluate the approach on a benchmark that contains the relative entity relatedness scores for 420 entity pairs. Our approach improves the accuracy by 12% on state of the art methods for computing entity relatedness. We also show an evaluation of DiSER in the Entity Disambiguation task on a dataset of 50 sentences with highly ambiguous entity mentions. It shows an improvement of 10% in precision over the best performing methods. In order to provide the resource that can be used to find out all the related entities for a given entity, a graph is constructed, where the nodes represent Wikipedia entities and the relatedness scores are reflected by the edges. Wikipedia contains more than 4.1 millions entities, which required efficient computation of the relatedness scores between the corresponding 17 trillions of entity-pairs.


Robotic and Virtual Companions for Isolated Older Adults

AAAI Conferences

The agent is "always on," i.e. it is continuously available and aware (using a camera and infrared motion sensor) when the user is in its presence and can initiate interaction with the user, rather than requiring the user login to begin interaction. We expect that the agent will help reduce the user's isolation not just by always being around but also by specific activities that connect the user with friends, family and the local community. Our goal is for the agent to be a natural, humanlike presence that "resides" in the user's apartment. Beginning in the late summer of 2014, we will be placing our agents with users for a monthlong evaluation study. Figure 1: Virtual agent interface -- "Karen" Three issues of our project directly concern the topics of this workshop are: (1) the embodiment of the agent, (2) the engagement behaviors that are associated with being "always measures we will be using are questionnaires that assess the on," and (3) AI tools for support intelligent behavior.


Instance-Privacy Preserving Crowdsourcing

AAAI Conferences

Crowdsourcing is a technique to outsource tasks to a number of workers. Although crowdsourcing has many advantages, it gives rise to the risk that sensitive information may be leaked, which has limited the spread of its popularity. Task instances (data workers receive to process tasks) often contain sensitive information, which can be extracted by workers. For example, in an audio transcription task, an audio file corresponds to an instance, and the content of the audio (e.g., the abstract of a meeting) can be sensitive information. In this paper, we propose a quantitative analysis framework for the instance privacy problem. The proposed framework supplies us performance measures of instance privacy preserving protocols. As a case study, we apply the proposed framework to an instance clipping protocol and analyze the properties of the protocol. The protocol preserves privacy by clipping instances to limit the amount of information workers obtain. The results show that the protocol can balance task performance and instance privacy preservation. They also show that the proposed measure is consistent with standard measures, which validates the proposed measure.


Groupsourcing: Problem Solving, Social Learning and Knowledge Discovery on Social Networks

AAAI Conferences

Increasingly social networks are being used for citizen science, where members of the public contribute knowledge to scientific endeavours. Tasks can be presented and solved using human computation, termed groupsourcing, with users benefiting from community tuition and experts gaining knowledge from the crowd. This paper gives details of a prototype that utilises groupsourcing to solve image classification tasks, to support social learning and to facilitate knowledge discovery in the domain of marine biology.


Groupsourcing: Distributed Problem Solving Using Social Networks

AAAI Conferences

Crowdsourcing and citizen science have established themselves in the mainstream of research methodology in recent years, employing a variety of methods to solve problems using human computation. An approach described here, termed "groupsourcing", uses social networks to present problems and collect solutions. This paper details a method for archiving social network messages and investigates messages containing an image classification task in the domain of marine biology. In comparison to other methods, groupsourcing offers a high accuracy, data-driven and low cost approach.


Adapting Collaborative Filtering to Personalized Audio Production

AAAI Conferences

Recommending media objects to users typically requires users to rate existing media objects so as to understand their preferences. The number of ratings required to produce good suggestions can be reduced through collaborative filtering. Collaborative filtering is more difficult when prior users have not rated the same set of media objects as the current user or each other. In this work, we describe an approach to applying prior user data in a way that does not require users to rate the same media objects and that does not require imputation (estimation) of prior user ratings of objects they have not rated. This approach is applied to the problem of finding good equalizer settings for music audio and is shown to greatly reduce the number of ratings the current user must make to find a good equalization setting.


Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds

arXiv.org Machine Learning

In this paper, we initiate a systematic investigation of differentially private algorithms for convex empirical risk minimization. Various instantiations of this problem have been studied before. We provide new algorithms and matching lower bounds for private ERM assuming only that each data point's contribution to the loss function is Lipschitz bounded and that the domain of optimization is bounded. We provide a separate set of algorithms and matching lower bounds for the setting in which the loss functions are known to also be strongly convex. Our algorithms run in polynomial time, and in some cases even match the optimal non-private running time (as measured by oracle complexity). We give separate algorithms (and lower bounds) for $(\epsilon,0)$- and $(\epsilon,\delta)$-differential privacy; perhaps surprisingly, the techniques used for designing optimal algorithms in the two cases are completely different. Our lower bounds apply even to very simple, smooth function families, such as linear and quadratic functions. This implies that algorithms from previous work can be used to obtain optimal error rates, under the additional assumption that the contributions of each data point to the loss function is smooth. We show that simple approaches to smoothing arbitrary loss functions (in order to apply previous techniques) do not yield optimal error rates. In particular, optimal algorithms were not previously known for problems such as training support vector machines and the high-dimensional median.


Linearized and Single-Pass Belief Propagation

arXiv.org Artificial Intelligence

How can we tell when accounts are fake or real in a social network? And how can we tell which accounts belong to liberal, conservative or centrist users? Often, we can answer such questions and label nodes in a network based on the labels of their neighbors and appropriate assumptions of homophily ("birds of a feather flock together") or heterophily ("opposites attract"). One of the most widely used methods for this kind of inference is Belief Propagation (BP) which iteratively propagates the information from a few nodes with explicit labels throughout a network until convergence. One main problem with BP, however, is that there are no known exact guarantees of convergence in graphs with loops. This paper introduces Linearized Belief Propagation (LinBP), a linearization of BP that allows a closed-form solution via intuitive matrix equations and, thus, comes with convergence guarantees. It handles homophily, heterophily, and more general cases that arise in multi-class settings. Plus, it allows a compact implementation in SQL. The paper also introduces Single-pass Belief Propagation (SBP), a "localized" version of LinBP that propagates information across every edge at most once and for which the final class assignments depend only on the nearest labeled neighbors. In addition, SBP allows fast incremental updates in dynamic networks. Our runtime experiments show that LinBP and SBP are orders of magnitude faster than standard