Goto

Collaborating Authors

 Yahoo! Research


HARP: Hierarchical Representation Learning for Networks

AAAI Conferences

We present HARP, a novel method for learning low dimensional embeddings of a graphโ€™s nodes which preserves higher-order structural features. Our proposed method achieves this by compressing the input graph prior to embedding it, effectively avoiding troublesome embedding configurations (i.e. local minima) which can pose problems to non-convex optimization. HARP works by finding a smaller graph which approximates the global structure of its input. This simplified graph is used to learn a set of initial representations, which serve as good initializations for learning representations in the original, detailed graph. We inductively extend this idea, by decomposing a graph in a series of levels, and then embed the hierarchy of graphs from the coarsest one to the original graph. HARP is a general meta-strategy to improve all of the state-of-the-art neural algorithms for embedding graphs, including DeepWalk, LINE, and Node2vec. Indeed, we demonstrate that applying HARPโ€™s hierarchical paradigm yields improved implementations for all three of these methods, as evaluated on classification tasks on real-world graphs such as DBLP, BlogCatalog, and CiteSeer, where we achieve a performance gain over the original implementations by up to 14% Macro F1.


RIPML: A Restricted Isometry Property-Based Approach to Multilabel Learning

AAAI Conferences

The multilabel learning problem with large number of labels, features, and data-points has generated a tremendous interest recently. A recurring theme of these problems is that only a few labels are active in any given data point as compared to the total number of labels. However, only a small number of existing work take direct advantage of this inherent extreme sparsity in the label space. By the virtue of Restricted Isometry Property (RIP), satisfied by many random ensembles, we propose a novel procedure for multilabel learning known as RIPML. During the training phase, in RIPML, labels are projected onto a random low-dimensional subspace followed by solving a least-square problem in this subspace. Inference is done by a k-nearest neighbor (kNN) based approach. We demonstrate the effectiveness of RIPML by conducting extensive simulations and comparing results with the state-of-the-art linear dimensionality reduction based approaches.


Leveraging Browsing Patterns for Topic Discovery and Photostream Recommendation

AAAI Conferences

In photo-sharing websites and in social networks, photographs are most often browsed as a sequence: users who view a photo are likely to click on those that follow. The sequences of photos (which we call photostreams), as opposed to individual images, can therefore be considered to be very important content units in their own right. In spite of their importance, those sequences have received little attention even though they are at the core of how people consume image content. In this paper, we focus on photostreams. First, we perform an analysis of a large dataset of user logs containing over 100 million pageviews, examining navigation patterns between photostreams. Based on observations from the analysis, we build a stream transition graph to analyze common stream topic transitions (e.g., users often view โ€œtrainโ€ photostreams followed by โ€œfiretruckโ€ photostreams). We then implement two stream recommendation algorithms, based on collaborative filtering and on photo tags, and report the results of a user study involving 40 participants. Our analysis yields interesting insights into how people navigate between photostreams, while the results of the user study provide useful feedback for evaluating the performance and characteristics of the recommendation algorithms.


Fairness and Welfare Through Redistribution When Utility Is Transferable

AAAI Conferences

We join the goals of two giant and related fields of research in group decision-making that have historically had little contact: fair division, and efficient mechanism design with monetary payments. To do this we adopt the standard mechanism design paradigm where utility is assumed to be quasilinear and thus transferable across agents. We generalize the traditional binary criteria of envy-freeness, proportionality, and efficiency (welfare) to measures of degree that range between 0 and 1. We demonstrate that in the canonical fair division settings under any allocatively-efficient mechanism the worst-case welfare rate is 0 and disproportionality rate is 1; in other words, the worst-case results are as bad as possible. This strongly motivates an average-case analysis. We then set as the goal identification of a mechanism that achieves high welfare, low envy, and low disproportionality in expectation across a spectrum of fair division settings. We establish that the VCG mechanism is not a satisfactory candidate, but the redistribution mechanism of [Bailey, 1997; Cavallo, 2006] is.


Efficient Crowdsourcing With Stochastic Production

AAAI Conferences

A principal seeks production of a good within a limited time-frame with a hard deadline, after which any good procured has no value. There is inherent uncertainty in the production process, which in light of the deadline may warrant simultaneous production of multiple goods by multiple producers despite there being no marginal value for extra goods beyond the maximum quality good produced. This motivates a crowdsourcing model of procurement. We address efficient execution of such procurement from a social planner's perspective, taking account of and optimally balancing the value to the principal with the costs to producers (modeled as effort expenditure) while, crucially, contending with self-interest on the part of all players. A solution to this problem involves both an algorithmic aspect that determines an optimal effort level for each producer given the principal's value, and also an incentive mechanism that achieves equilibrium implementation of the socially optimal policy despite the principal privately observing his value, producers privately observing their skill levels and effort expenditure, and all acting only to maximize their own individual welfare. In contrast to popular "winner take all" contests, the efficient mechanism we propose involves a payment to every producer that expends non-zero effort in the efficient policy.


Using Crowdsourcing to Improve Profanity Detection

AAAI Conferences

Profanity detection is often thought to be an easy task. However, past work has shown that current, list-based systems are performing poorly. They fail to adapt to evolving profane slang, identify profane terms that have been disguised or only partially censored (e.g., @ss, f$#%) or intentionally or unintentionally misspelled (e.g., biatch, shiiiit). For these reasons, they are easy to circumvent and have very poor recall. Secondly, they are a one-size fits all solution โ€“ making assumptions that the definition, use and perceptions of profane or inappropriate holds across all contexts. In this article, we present work that attempts to move beyond list-based profanity detection systems by identifying the context in which profanity occurs. The proposed system uses a set of comments from a social news site labeled by Amazon Mechanical Turk workers for the presence of profanity. This system far surpasses the performance of list-based profanity detection techniques. The use of crowdsourcing in this task suggests an opportunity to build profanity detection systems tailored to sites and communities.


Who Does What on the Web: A Large-Scale Study of Browsing Behavior

AAAI Conferences

As the Web has become integrated into daily life, understanding how individuals spend their time online impacts domains ranging from public policy to marketing. It is difficult, however, to measure even simple aspects of browsing behavior via conventional methods---including surveys and site-level analytics---due to limitations of scale and scope. In part addressing these limitations, large-scale Web panel data are a relatively novel means for investigating patterns of Internet usage. In one of the largest studies of browsing behavior to date, we pair Web histories for 250,000 anonymized individuals with user-level demographics---including age, sex, race, education, and income---to investigate three topics. First, we examine how behavior changes as individuals spend more time online, showing that the heaviest users devote nearly twice as much of their time to social media relative to typical individuals. Second, we revisit the digital divide, finding that the frequency with which individuals turn to the Web for research, news, and healthcare is strongly related to educational background, but not as closely tied to gender and ethnicity. Finally, we demonstrate that browsing histories are a strong signal for inferring user attributes, including ethnicity and household income, a result that may be leveraged to improve ad targeting.


Social Media Is NOT that Bad! The Lexical Quality of Social Media

AAAI Conferences

There is a strong correlation between spelling errors and web text content quality. Using our lexical quality measure,ย  based in a small corpus of spelling errors, we present an estimation of the lexical quality of the main Social Media sites. This paper presents an updated and complete analysis of the lexical quality of Social Media written in English and Spanish, including how lexical quality changes in time.


A Kernel-Based Iterative Combinatorial Auction

AAAI Conferences

This paper describes an iterative combinatorial auction for single-minded bidders that offers modularity in the choice of price structure, drawing on ideas from kernel methods and the primal-dual paradigm of auction design. In our implementation, the auction is able to automatically detect, as the rounds progress, whether price expressiveness must be increased to clear the market. The auction also features a configurable step size which can be tuned to trade-off between monotonicity in prices and the number of bidding rounds, with no impact on efficiency. An empirical evaluation against a state of the art ascending-price auction demonstrates the performance gains that can be obtained in efficiency, revenue, and rounds to convergence through various configurations of our design.


Event Summarization Using Tweets

AAAI Conferences

Twitter has become exceedingly popular, with hundreds of millions of tweets being posted every day on a wide variety of topics. This has helped make real-time search applications possible with leading search engines routinely displaying relevant tweets in response to user queries. Recent research has shown that a considerable fraction of these tweets are about "events," and the detection of novel events in the tweet-stream has attracted a lot of research interest. However, very little research has focused on properly displaying this real-time information about events. For instance, the leading search engines simply display all tweets matching the queries in reverse chronological order. In this paper we argue that for some highly structured and recurring events, such as sports, it is better to use more sophisticated techniques to summarize the relevant tweets. We formalize the problem of summarizing event-tweets and give a solution based on learning the underlying hidden state representation of the event via Hidden Markov Models. In addition, through extensive experiments on real-world data we show that our model significantly outperforms some intuitive and competitive baselines.