AITopics | Information Technology

MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity ($\mathcal{R}$), while the collision probability of SimHash is a function of cosine similarity ($\mathcal{S}$). To provide a common basis for comparison, we evaluate retrieval results in terms of $\mathcal{S}$ for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to $\mathcal{S}$, by using a general inequality $\mathcal{S}^2\leq \mathcal{R}\leq \frac{\mathcal{S}}{2-\mathcal{S}}$. Our worst case analysis can show that MinHash significantly outperforms SimHash in high similarity region. Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often $\mathcal{R}\geq \frac{\mathcal{S}}{z-\mathcal{S}}$ holds where $z$ is only slightly larger than 2 (e.g., $z\leq 2.1$). Our restricted worst case analysis by assuming $\frac{\mathcal{S}}{z-\mathcal{S}}\leq \mathcal{R}\leq \frac{\mathcal{S}}{2-\mathcal{S}}$ shows that MinHash indeed significantly outperforms SimHash even in low similarity region. We believe the results in this paper will provide valuable guidelines for search in practice, especially when the data are sparse.

artificial intelligence, minhash, text processing, (19 more...)

arXiv.org Machine Learning

1407.4416

Country:

Europe > Spain (0.46)
North America > United States (0.28)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.36)

Add feedback

Collaborative Filtering Ensemble for Personalized Name Recommendation

Coma-Puig, Bernat, Diaz-Aviles, Ernesto, Nejdl, Wolfgang

arXiv.org Artificial IntelligenceJul-16-2014

Out of thousands of names to choose from, picking the right one for your child is a daunting task. In this work, our objective is to help parents making an informed decision while choosing a name for their baby. We follow a recommender system approach and combine, in an ensemble, the individual rankings produced by simple collaborative filtering algorithms in order to produce a personalized list of names that meets the individual parents' taste. Our experiments were conducted using real-world data collected from the query logs of 'nameling' (nameling.net), an online portal for searching and exploring names, which corresponds to the dataset released in the context of the ECML PKDD Discover Challenge 2013. Our approach is intuitive, easy to implement, and features fast training and prediction steps.

artificial intelligence, ensemble, social media, (15 more...)

arXiv.org Artificial Intelligence

1407.4832

Country:

Europe (0.46)
North America > United States > Virginia (0.14)

Industry: Information Technology (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)

Add feedback

Modeling and Predicting Popularity Dynamics via Reinforced Poisson Processes

Shen, Huawei (Chinese Academy of Sciences) | Wang, Dashun (IBM Thomas J. Watson Research Center) | Song, Chaoming (University of Miami) | Barabási, Albert-László (Northeastern University)

AAAI ConferencesJul-14-2014

Indeed, to the best of our knowledge, we lack forgotten over time (Wu and Humberman 2007). For example, a probabilistic framework to model and predict the popularity videos on YouTube or stories on Digg gain their popularity dynamics of individual items. The reason behind this is by striving for views or votes (Szabo and Huberman partly illustrated in Figure 1, suggesting that the dynamical 2010); papers increase their visibility by competing for citations processes governing individual items appear too noisy to be from new papers (Ren et al. 2010; Wang, Song, and amenable to quantification. Barabási 2013); tweets or Hashtags in Twitter become more In this paper, we model the stochastic popularity dynamics popular as being retweeted (Hong, Dan, and Davison 2011) using reinforced Poisson processes, capturing simultaneously and so do webpages as being attached by incoming hyperlinks three key ingredients: fitness of an item, characterizing (Ratkiewicz et al. 2010). An ability to predict the popularity its inherent competitiveness against other items; a general of individual items within a dynamically evolving system temporal relaxation function, corresponding to the aging not only probes our understanding of complex systems, in the ability to attract new attentions; and a reinforcement but also has important implications in a wide range of domains, mechanism, documenting the well-known "rich-get-richer" from marketing and traffic control to policy making phenomenon. The benefit of the proposed model is threefold: and risk management. Despite recent advances of empirical (1) It models the arrival process of individual attentions methods, we lack a general modeling framework to predict directly in contrast to relying on aggregated popularity the popularity of individual items within a complex evolving time series; (2) As a generative probabilistic model, it can be system.

bayesian inference, popularity dynamic, survey article, (18 more...)

AAAI Conferences

Twenty-Eighth AAAI Conference on Artificial Intelligence

Country: North America > United States > Massachusetts (0.14)

Industry:

Information Technology (0.48)
Government (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.69)

Add feedback

On Dataless Hierarchical Text Classification

Song, Yangqiu (University of Illinois at Urbana-Champaign) | Roth, Dan (University of Illinois at Urbana-Champaign)

AAAI ConferencesJul-14-2014

In this paper, we systematically study the problem of dataless hierarchical text classification. Unlike standard text classification schemes that rely on supervised training, dataless classification depends on understanding the labels of the sought after categories and requires no labeled data. Given a collection of text documents and a set of labels, we show that understanding the labels can be used to accurately categorize the documents. This is done by embedding both labels and documents in a semantic space that allows one to compute meaningful semantic similarity between a document and a potential label. We show that this scheme can be used to support accurate multiclass classification without any supervision. We study several semantic representations and show how to improve the classification using bootstrapping. Our results show that bootstrapped dataless classification is competitive with supervised classification with thousands of labeled examples.

artificial intelligence, classification, text processing, (20 more...)

AAAI Conferences

Twenty-Eighth AAAI Conference on Artificial Intelligence

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.86)

Industry:

Government (0.68)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.70)

Add feedback

Learning Deep Representations for Graph Clustering

Tian, Fei (University of Science and Technology of China) | Gao, Bin (Microsoft Research) | Cui, Qing (Tsinghua University) | Chen, Enhong (University of Science and Technology of China) | Liu, Tie-Yan (Microsoft Research)

AAAI ConferencesJul-14-2014

Recently deep learning has been successfully adopted in many applications such as speech recognition and image classification. In this work, we explore the possibility of employing deep learning in graph clustering. We propose a simple method, which first learns a nonlinear embedding of the original graph by stacked autoencoder, and then runs $k$-means algorithm on the embedding to obtain the clustering result. We show that this simple method has solid theoretical foundation, due to the similarity between autoencoder and spectral clustering in terms of what they actually optimize. Then, we demonstrate that the proposed method is more efficient and flexible than spectral clustering. First, the computational complexity of autoencoder is much lower than spectral clustering: the former can be linear to the number of nodes in a sparse graph while the latter is super quadratic due to eigenvalue decomposition. Second, when additional sparsity constraint is imposed, we can simply employ the sparse autoencoder developed in the literature of deep learning; however, it is non-straightforward to implement a sparse spectral method. The experimental results on various graph datasets show that the proposed method significantly outperforms conventional spectral clustering which clearly indicates the effectiveness of deep learning in graph clustering.

deep learning, neural network, spectral, (20 more...)

AAAI Conferences

Twenty-Eighth AAAI Conference on Artificial Intelligence

Country: Asia (0.14)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Low-Rank Tensor Learning with Discriminant Analysis for Action Classification and Image Recovery

Jia, Chengcheng (Northeastern University) | Zhong, Guoqiang (Ocean University of China) | Fu, Yun (Northeastern University)

AAAI ConferencesJul-14-2014

Tensor completion is an important topic in the area of image processing and computer vision research, which is generally built on extraction of the intrinsic structure of the tensor data. Drawing on this fact, action classification, relying heavily on the extracted features of high-dimensional tensors, may indeed benefit from tensor completion techniques. In this paper, we propose a low-rank tensor completion method for action classification, as well as image recovery. Since there may exist distortion and corruption in the tensor representations of video sequences, we project the tensors into a subspace, which contains the invariant structure of the tensors. In order to integrate useful supervisory information for classification, we adopt a discriminant analysis criterion to learn the projection matrices. The resulting multi-variate optimization problem can be effectively solved using the augmented Lagrange multiplier (ALM) algorithm. Experiments demonstrate that our method results with better accuracy compared with some other state-of-the-art low-rank tensor representation learning approaches on the MSR Hand Gesture 3D database and the MSR Action 3D database. By denoising the Multi-PIE face database, our experimental setup testifies the proposed method can also be employed to recover images.

artificial intelligence, optimization problem, tensor, (16 more...)

AAAI Conferences

Twenty-Eighth AAAI Conference on Artificial Intelligence

Country:

North America > United States (0.28)
Europe (0.28)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision > Face Recognition (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.34)

Add feedback

Lifetime Lexical Variation in Social Media

Liao, Lizi (Beijing Institute of Technology) | Jiang, Jing (Singapore Management University) | Ding, Ying (Singapore Management University) | Huang, Heyan (Beijing Institute of Technology) | Lim, Ee-Peng (Singapore Management University)

AAAI ConferencesJul-14-2014

As the rapid growth of online social media attracts a large number of Internet users, the large volume of content generated by these users also provides us with an opportunity to study the lexical variation of people of different ages. In this paper, we present a latent variable model that jointly models the lexical content of tweets and Twitter users’ ages. Our model inherently assumes that a topic has not only a word distribution but also an age distribution. We propose a Gibbs-EM algorithm to perform inference on our model. Empirical evaluation shows that our model can learn meaningful age-specific topics such as “school” for teenagers and “health” for older people. Our model can also be used for age prediction and performs better than a number of baseline methods.

social media, survey article, tweet, (19 more...)

AAAI Conferences

Twenty-Eighth AAAI Conference on Artificial Intelligence

Country: Asia > China (0.14)

Industry:

Information Technology > Services (1.00)
Health & Medicine (0.69)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.89)

Add feedback

Combining Heterogenous Social and Geographical Information for Event Recommendation

AAAI ConferencesJul-14-2014

With the rapid growth of event-based social networks (EBSNs) like Meetup, the demand for event recommendation becomes increasingly urgent. In EBSNs, event recommendation plays a central role in recommending the most relevant events to users who are likely to participate in. Different from traditional recommendation problems, event recommendation encounters three new types of information, i.e., heterogenous online+offline social relationships, geographical features of events and implicit rating data from users. Yet combining the three types of data for offline event recommendation has not been considered. Therefore, we present a Bayesian latent factor model that can unify these data for event recommendation. Experimental results on real-world data sets show the performance of our method.

artificial intelligence, recommendation, social media, (16 more...)

AAAI Conferences

Twenty-Eighth AAAI Conference on Artificial Intelligence

Country: North America > United States (0.47)

Industry: Information Technology (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Communications > Social Media (0.91)

Add feedback

Mechanism Design for Mobile Geo-Location Advertising

Gatti, Nicola (Politecnico di Milano) | Rocco, Marco (Politecnico di Milano) | Ceppi, Sofia (Microsoft Research) | Gerding, Enrico H. (University of Southampton)

AAAI ConferencesJul-14-2014

Mobile geo-location advertising, where mobile ads are targeted based on a user’s location, has been identified as a key growth factor for the mobile market. As with online advertising, a crucial ingredient for their success is the development of effective economic mechanisms. An important difference is that mobile ads are shown sequentially over time and information about the user can be learned based on their movements. Furthermore, ads need to be shown selectively to prevent ad fatigue. To this end, we introduce, for the first time, a user model and suitable economic mechanisms which take these factors into account. Specifically, we design two truthful mechanisms which produce an advertisement plan based on the user’s movements. One mechanism is allocatively efficient, but requires exponential compute time in the worst case. The other requires polynomial time, but is not allocatively efficient. Finally, we experimentally evaluate the trade off between compute time and efficiency of our mechanisms.

algorithm, artificial intelligence, optimization problem, (18 more...)

AAAI Conferences

Twenty-Eighth AAAI Conference on Artificial Intelligence

Country: Europe > Italy (0.14)

Industry:

Marketing (1.00)
Information Technology > Services (0.66)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Model-Based Reasoning (0.41)

Add feedback