Statistical Learning
Cross-Community Influence in Discussion Fora
Belák, Václav (National University of Ireland, Galway) | Lam, Samantha (National University of Ireland, Galway) | Hayes, Conor (National University of Ireland, Galway)
Online discussion fora have become an important cultural and business asset in the context of many services provided by both non-profit organizations and enterprises. In order to keep and eventually increase the value these systems deliver to their users, it is often necessary to moderate or even manage their dynamics. One way to do this efficiently is to focus primarily on the most influential actors in the system. However, identifying such users becomes increasingly hard with systems where there is a continuously growing large user base. We show that analysis and explanation of influence on the cross-community level is a promising way to provide a coarse-grained picture of a potentially very large system and that it may enable its stakeholders to find groups through which the system can be efficiently influenced, or it can help them to identify and avoid activity considered as malicious. In order to achieve that, we present a novel framework for cross-community influence analysis, which is evaluated on 10 years of data from the largest Irish online discussion system Boards.ie.
MAV Stabilization using Machine Learning and Onboard Sensors
Yosinski, Jason, Bills, Cooper
Past automation work with miniature aerial vehicles (MAVs) at Cornell has produced interesting results [1] and presented additional challenges. During past projects, results have often been limited not by insufficiencies in planning algorithms, but by navigation errors stemming from inadequate control in the face of realistic, breezy operating environments. In many cases the MAVs will simply drift off the desired path (Figure 1). Thus, this project focuses on refining the basic motion of the same platform, and in particular, minimizing its drift. Our work focuses on reduction of low frequency drift in gps-denied environments. Similar work has been done, some using neural networks [4] or using adaptive-fuzzy control methods [5] to stabilize a quadrotor. Though this research has produced promising results, these methods were demonstrated only in simulation, not via live testing. 1 Figure 1: Desired path vs. actual path due to drift.
Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences
Recht, Benjamin, Re, Christopher
Randomized algorithms that base iteration-level decisions on samples from some pool are ubiquitous in machine learning and optimization. Examples include stochastic gradient descent and randomized coordinate descent. This paper makes progress at theoretically evaluating the difference in performance between sampling with- and without-replacement in such algorithms. Focusing on least means squares optimization, we formulate a noncommutative arithmetic-geometric mean inequality that would prove that the expected convergence rate of without-replacement sampling is faster than that of with-replacement sampling. We demonstrate that this inequality holds for many classes of random matrices and for some pathological examples as well. We provide a deterministic worst-case bound on the gap between the discrepancy between the two sampling models, and explore some of the impediments to proving this inequality in full generality. We detail the consequences of this inequality for stochastic gradient descent and the randomized Kaczmarz algorithm for solving linear systems.
Comparing SVM and Naive Bayes classifiers for text categorization with Wikitology as knowledge enrichment
Hassan, Sundus, Rafi, Muhammad, Shaikh, Muhammad Shahid
The activity of labeling of documents according to their content is known as text categorization. Many experiments have been carried out to enhance text categorization by adding background knowledge to the document using knowledge repositories like Word Net, Open Project Directory (OPD), Wikipedia and Wikitology. In our previous work, we have carried out intensive experiments by extracting knowledge from Wikitology and evaluating the experiment on Support Vector Machine with 10- fold cross-validations. The results clearly indicate Wikitology is far better than other knowledge bases. In this paper we are comparing Support Vector Machine (SVM) and Na\"ive Bayes (NB) classifiers under text enrichment through Wikitology. We validated results with 10-fold cross validation and shown that NB gives an improvement of +28.78%, on the other hand SVM gives an improvement of +6.36% when compared with baseline results. Na\"ive Bayes classifier is better choice when external enriching is used through any external knowledge base.
Vector-valued Reproducing Kernel Banach Spaces with Applications to Multi-task Learning
The purpose of this paper is to establish the notion of vector-valued reproducing kernel Banach spaces and demonstrate its applications to multi-task machine learning. Built on the theory of scalar-valued reproducing kernel Hilbert spaces (RKHS) [3], kernel methods have been proven successful in single task machine learning [10, 14, 29, 30, 33]. Multi-task learning where the unknown target function to be learned from finite sample data is vector-valued appears more often in practice. References [13, 25] proposed the development of kernel methods for learning multiple related tasks simultaneously. The mathematical foundation used there was the theory of vector-valued RKHS [5, 27].
The Future of Search and Discovery in Big Data Analytics: Ultrametric Information Spaces
Murtagh, Fionn, Contreras, Pedro
Under the heading of "Addressing the big data challenge", the European 7th Framework Programme sees the issue thus (see INFSO, 2012): "Recent industry reports detail how data volumes are growing at a faster rate than our ability to interpret and exploit them for innovative ICT applications, for decision support, planning, monitoring, control and interaction. This includes unstructured data types such as video, audio, images and free text as well as structured data types such as database records, sensor readings and 3D. While each of these types requires some specific form of processing and analytics, many of the general principles for managing and storing them at extreme scales are common across all of them." Analytics tool capability is called for, to address these burgeoning issues in the data intensive industries, to support "effective policy making and implementation" of public bodies resulting in "significant annual savings from 1 Big Data applications", and also to exploit open, linked data - "foster the reuse of public sector information and strengthen other open data activities linked to commercial exploitation." The "big data" marketplace is stated to be potentially worth approximately USD 600 billion. To address the challenges of search and discovery in massive and complex data sets and data flows, it is our contention in this work that we must move to an appropriate topology - to an appropriate framework such that computation is greatly facilitated. Our work is all about empowering those who are involved in data analytics, through clustering and related algorithms, to face these new challenges. Scalability and interactivity are two of the performance issues that follow directly from clustering algorithms, for search, retrieval and discovery, that are of linear computational complexity or better (logarithmic, or constant).
Bregman divergence as general framework to estimate unnormalized statistical models
Gutmann, Michael, Hirayama, Jun-ichiro
We show that the Bregman divergence provides a rich framework to estimate unnormalized statistical models for continuous or discrete random variables, that is, models which do not integrate or sum to one, respectively. We prove that recent estimation methods such as noise-contrastive estimation, ratio matching, and score matching belong to the proposed framework, and explain their interconnection based on supervised learning. Further, we discuss the role of boosting in unsupervised learning.
Sparse Topical Coding
Such relaxations make STC amenable to: 1) directly control the sparsity of inferred representations by using sparsity-inducing regularizers; 2) be seamlessly integrated with a convex error function (e.g., SVM hinge loss) for supervised learning; and 3) be efficiently learned with a simply structured coordinate descent algorithm. Our results demonstrate the advantages of STC and supervised MedSTC on identifying topical meanings of words and improving classification accuracy and time efficiency.
Smoothing Multivariate Performance Measures
Zhang, Xinhua, Saha, Ankan, Vishwanatan, S. V. N.
A Support Vector Method for multivariate performance measures was recently introduced by Joachims (2005). The underlying optimization problem is currently solved using cutting plane methods such as SVM-Perf and BMRM. One can show that these algorithms converge to an eta accurate solution in O(1/Lambda*e) iterations, where lambda is the trade-off parameter between the regularizer and the loss function. We present a smoothing strategy for multivariate performance scores, in particular precision/recall break-even point and ROCArea. When combined with Nesterov's accelerated gradient algorithm our smoothing strategy yields an optimization algorithm which converges to an eta accurate solution in O(min{1/e,1/sqrt(lambda*e)}) iterations. Furthermore, the cost per iteration of our scheme is the same as that of SVM-Perf and BMRM. Empirical evaluation on a number of publicly available datasets shows that our method converges significantly faster than cutting plane methods without sacrificing generalization ability.
Hierarchical Maximum Margin Learning for Multi-Class Classification
Due to myriads of classes, designing accurate and efficient classifiers becomes very challenging for multi-class classification. Recent research has shown that class structure learning can greatly facilitate multi-class learning. In this paper, we propose a novel method to learn the class structure for multi-class classification problems. The class structure is assumed to be a binary hierarchical tree. To learn such a tree, we propose a maximum separating margin method to determine the child nodes of any internal node. The proposed method ensures that two classgroups represented by any two sibling nodes are most separable. In the experiments, we evaluate the accuracy and efficiency of the proposed method over other multi-class classification methods on real world large-scale problems. The results show that the proposed method outperforms benchmark methods in terms of accuracy for most datasets and performs comparably with other class structure learning methods in terms of efficiency for all datasets.