Genre
Buried Utility Pipeline Mapping Based on Multiple Spatial Data Sources: A Bayesian Data Fusion Approach
Chen, Huanhuan (University of Leeds) | Cohn, Anthony G. (University of Leeds)
Statutory records of underground utility apparatus (such as pipes andcables) are notoriously inaccurate, so street surveys are usually undertakenbefore road excavation takes place to minimize the extent and duration ofexcavation and for health and safety reasons. This involves the use ofsensors such as Ground Penetrating Radar (GPR). The GPR scans are thenmanually interpreted and combined with the expectations from the utilityrecords and other data such as surveyed manholes. The task is complex owingto the difficulty in interpreting the sensor data, and the spatialcomplexity and extent of under street assets. We explore the application ofAI techniques, in particular Bayesian data fusion (BDF), to automaticallygenerate maps of buried apparatus. Hypotheses about the spatial location anddirection of buried assets are extracted by identifying hyperbolae in theGPR scans. The spatial location of surveyed manholes provides further inputto the algorithm, as well as the prior expectations from the statutoryrecords. These three data sources are used to produce the most probable mapof the buried assets. Experimental results on real and simulated data setsare presented.
Recommender Systems from "Words of Few Mouths"
Zhang, Richong (University of Ottawa) | Tran, Thomas (University of Ottawa) | Mao, Yongyi (University of Ottawa)
This paper identifies a widely existing phenomenon in web data, which we call the "words of few mouths" phenomenon. This phenomenon, in the context of online reviews, refers to the case that a large fraction of the reviews are each voted only by very few users. We discuss the challenges of "words of few mouths" in the development of recommender systems based on users' opinions and advocate probabilistic methodologies to handle such challenges. We develop a probabilistic model and correspondingly a logistic regression based learning algorithm for review helpfulness prediction. Our experimental results indicate that the proposed model outperforms the current state-of-the-art algorithms not only in the presence of the "words of few mouths" phenomenon, but also in the absence of such phenomena.
Efficient Searching Top-k Semantic Similar Words
Yang, Zhenglu (The University of Tokyo) | Kitsuregawa, Masaru (The University of Tokyo)
Measuring the semantic meaning between words is an important issue because it is the basis for many applications, such as word sense disambiguation, document summarization, and so forth. Although it has been explored for several decades, most of the studies focus on improving the effectiveness of the problem, i.e., precision and recall. In this paper, we propose to address the efficiency issue, that given a collection of words, how to efficiently discover the top-k most semantic similar words to the query. This issue is very important for real applications yet the existing state-of-the-art strategies cannot satisfy users with reasonable performance. Efficient strategies on searching top-k semantic similar words are proposed. We provide an extensive comparative experimental evaluation demonstrating the advantages of the introduced strategies over the state-of-the-art approaches.
Predicting Epidemic Tendency through Search Behavior Analysis
Xu, Danqing (Tsinghua University) | Liu, Yiqun (Tsinghua University) | Zhang, Min (Tsinghua University) | Ma, Shaoping (Tsinghua University) | Cui, Anqi (Tsinghua University) | Ru, Liyun (Tsinghua University)
The possibility that influenza activity can be generally detected through search log analysis has been explored in recent years. However, previous studies have mainly focused on influenza, and little attention has been paid to other epidemics. With an analysis of web user behavior data, we consider the problem of predicting the tendency of hand-foot -and-mouth disease ย (HFMD), whose out-break in 2010 resulted in a great panic in China. In addi-tion to search queries, we consider usersโ interactions with search engines. Given the collected search logs, we cluster HFMD-related search queries, medical pages and news reports into the following sets: epidemic-related queries (ERQs), epidemic-related pages (ERPs) and ep-idemic-related news (ERNs). Furthermore, we count their own frequencies as different features, and we conduct a regression analysis with current HFMD occurrences. The experimental results show that these features exhibit good performances on both accuracy and timeliness.
Line Orthogonality in Adjacency Eigenspace with Application to Community Partition
Wu, Leting (University of North Carolina at Charlotte) | Ying, Xiaowei (University of North Carolina at Charlotte) | Wu, Xintao (University of North Carolina at Charlotte) | Zhou, Zhi-Hua (Nanjing University)
Different from Laplacian or normal matrix, the properties of the adjacency eigenspace received much less attention. Recent work showed that nodes projected into the adjacency eigenspace exhibit an orthogonal line pattern and nodes from the same community locate along the same line. In this paper, we conduct theoretical studies based on graph perturbation to demonstrate why this line orthogonality property holds in the adjacency eigenspace and why it generally disappears in the Laplacian and normal eigenspaces. Using the orthogonality property in the adjacency eigenspace, we present a graph partition algorithm, AdjCluster, which first projects node coordinates to the unit sphere and then applies the classic k-means to find clusters. Empirical evaluations on synthetic data and real-world social networks validate our theoretical findings and show the effectiveness of our graph partition algorithm.
A Wikipedia Based Semantic Graph Model for Topic Tracking in Blogosphere
Tang, Jintao (National University of Defense Technology) | Wang, Ting (National University of Defense Technology) | Lu, Qin (Hong Kong Polytechnic University) | Wang, Ji (National Laboratory for Parallel and Distributed Processing) | Li, Wenjie (Hong Kong Polytechnic University)
There are two key issues for information diffusion in blogosphere: (1) blog posts are usually short, noisy and contain multiple themes, (2) information diffusion through blogosphere is primarily driven by the โword-of-mouthโ effect, thus making topics evolve very fast. This paper presents a novel topic tracking approach to deal with these issues by modeling a topic as a semantic graph in which the semantic relatedness between terms are learned from Wikipedia. For a given topic/post, the named entities, Wikipedia concepts, and the semantic relatedness are extracted to generate the graph model. Noises are filtered out through a graph clustering algorithm. To handle topic evolution, the topic model is enriched by using Wikipedia as background knowledge. Furthermore, graph edit distance is used to measure the similarity between a topic and its posts. The proposed method is tested using real-world blog data. Experimental results show the advantage of the proposed method on tracking topics in short, noisy text.
Transfer Learning to Predict Missing Ratings Via Heterogeneous User Feedbacks
Pan, Weike (Hong Kong University of Science and Technology) | Liu, Nathan N. (Hong Kong University of Science and Technology) | Xiang, Evan W. (Hong Kong University of Science and Technology) | Yang, Qiang (Hong Kong University of Science and Technology)
Data sparsity due to missing ratings is a major challenge for collaborative filtering (CF) techniques in recommender systems. This is especially true for CF domains where the ratings are expressed numerically. We observe that, while we may lack the information in numerical ratings, we may have more data in the form of binary ratings. ย This is especially true when users can easily express themselves with their likes and dislikes for certain items. ย In this paper, we explore how to use the binary preference data expressed in the form of like/dislike to help reduce the impact of data sparsity of more expressive numerical ratings. ย We do this by transferring the rating knowledge from some auxiliary data source in binary form (that is, likes or dislikes), to a target numerical rating matrix. Our solution is to model both numerical ratings and like/dislike in a principled way, using a novel framework of Transfer by Collective Factorization (TCF). In particular, we construct the shared latent space collectively and learn the data-dependent effect separately. A major advantage of the TCF approach over previous collective matrix factorization (or bi-factorization) methods is that we are able to capture the data-dependent effect when sharing the data-independent knowledge, so as to increase the overall quality of knowledge transfer. Experimental results demonstrate the effectiveness of TCF at various sparsity levels as compared to several state-of-the-art methods.
Finding the Hidden Gems: Recommending Untagged Music
Horsburgh, Ben (Robert Gordon University) | Craw, Susan (Robert Gordon University) | Massie, Stewart (Robert Gordon University) | Boswell, Robin (Robert Gordon University)
We have developed a novel hybrid representation for Music Information Retrieval. Our representation is built by incorporating audio content into the tag space in a tag-track matrix, and then learning hybrid concepts using latent semantic analysis. We apply this representation to the task of music recommendation, using similarity-based retrieval from a query music track. We also develop a new approach to evaluating music recommender systems, which is based upon the relationship of users liking tracks. We are interested in measuring the recommendation quality, and the rate at which cold-start tracks are recommended. Our hybrid representation is able to outperform a tag-only representation, in terms of both recommendation quality and the rate that cold-start tracks are included as recommendations.
Fast Algorithm for Affinity Propagation
Fujiwara, Yasuhiro (Nippon Telegraph and Telephone Corporation) | Irie, Go (Nippon Telegraph and Telephone Corporation) | Kitahara, Tomoe (Nihon University)
Affinity Propagation is a state-of-the-art clustering method recently proposed by Frey and Dueck. It has been successfully applied to broad areas of computer science research because it has much better clustering performance than traditional clustering methods such as k -means. In order to obtain high quality sets of clusters, the original Affinity Propagation algorithm iteratively exchanges real-valued messages between all pairs of data points until convergence. However, this algorithm does not scale for large datasets because it requires quadratic CPU time in the number of data points to compute the messages. This paper proposes an efficient Affinity Propagation algorithm that guarantees the same clustering result as the original algorithm after convergence. The heart of our approach is (1) to prune unnecessary message exchanges in the iterations and (2) to compute the convergence values of pruned messages after the iterations to determine clusters. Experimental evaluations on several different datasets demonstrate the effectiveness of our algorithm.
Leveraging Unlabeled Data to Scale Blocking for Record Linkage
Cao, Yunbo (Microsoft Research Asia) | Chen, Zhiyuan (Dalian University of Technology) | Zhu, Jiamin (Shanghai Jiao Tong University) | Yue, Pei (Microsoft Corporation) | Lin, Chin-Yew (Microsoft Research Asia) | Yu, Yong (Shanghai Jiao Tong University)
Record linkage is the process of matching records between two (or multiple) data sets that represent the same real-world entity. An exhaustive record linkage process involves computing the similarities between all pairs of records, which can be very expensive for large data sets. Blocking techniques alleviate this problem by dividing the records into blocks and only comparing records within the same block. To be adaptive from domain to domain, one category of blocking technique formalizes 'construction of blocking scheme' as a machine learning problem. In the process of learning the best blocking scheme, previous learning-based techniques utilize only a set of labeled data. However, since the set of labeled data is usually not large enough to well characterize the unseen (unlabeled) data, the resultant blocking scheme may poorly perform on the unseen data by generating too many candidate matches. To address that, in this paper, we propose to utilize unlabeled data (in addition to labeled data) for learning blocking schemes. Our experimental results show that using unlabeled data in learning can remarkably reduce the number of candidate matches while keeping the same level of coverage for true matches.