Clustering
Markov models for ocular fixation locations in the presence and absence of colour
Kashlak, Adam B., Devane, Eoin, Dietert, Helge, Jackson, Henry
We propose to model the fixation locations of the human eye when observing a still image by a Markovian point process in R 2 . Our approach is data driven using k-means clustering of the fixation locations to identify distinct salient regions of the image, which in turn correspond to the states of our Markov chain. Bayes factors are computed as model selection criterion to determine the number of clusters. Furthermore, we demonstrate that the behaviour of the human eye differs from this model when colour information is removed from the given image.
The DARPA Twitter Bot Challenge
Subrahmanian, V. S., Azaria, Amos, Durst, Skylar, Kagan, Vadim, Galstyan, Aram, Lerman, Kristina, Zhu, Linhong, Ferrara, Emilio, Flammini, Alessandro, Menczer, Filippo, Stevens, Andrew, Dekhtyar, Alexander, Gao, Shuyang, Hogg, Tad, Kooti, Farshad, Liu, Yan, Varol, Onur, Shiralkar, Prashant, Vydiswaran, Vinod, Mei, Qiaozhu, Hwang, Tim
A number of organizations ranging from terrorist groups such as ISIS to politicians and nation states reportedly conduct explicit campaigns to influence opinion on social media, posing a risk to democratic processes. There is thus a growing need to identify and eliminate "influence bots" - realistic, automated identities that illicitly shape discussion on sites like Twitter and Facebook - before they get too influential. Spurred by such events, DARPA held a 4-week competition in February/March 2015 in which multiple teams supported by the DARPA Social Media in Strategic Communications program competed to identify a set of previously identified "influence bots" serving as ground truth on a specific topic within Twitter. Past work regarding influence bots often has difficulty supporting claims about accuracy, since there is limited ground truth (though some exceptions do exist [3,7]). However, with the exception of [3], no past work has looked specifically at identifying influence bots on a specific topic. This paper describes the DARPA Challenge and describes the methods used by the three top-ranked teams.
Michael Lane's Homepage
The final homework assignment for CS545 Machine Learning was to implement a K-means clustering algorithm to cluster and classify the OptDigits data. The raw data looks something like the figures to the left. So these instances are fields of 0's whereby some 0's have been flipped to be 1's such that the image is recognizable (to humans) as a handwritten digit. For the K-means classifier, we ran 2 different experiments. The first expeiment used 10 centroids (one per digit), the second used 30 centroids to see if it could find clusters where the handwritten digits were different enough to notice differences.
SAND: Semi-Supervised Adaptive Novel Class Detection and Classification over Data Stream
Haque, Ahsanul (The University of Texas at Dallas) | Khan, Latifur (The University of Texas at Dallas) | Baron, Michael (The University of Texas at Dallas)
Most approaches to classifying data streams either divide the stream into fixed-size chunks or use gradual forgetting. Due to evolving nature of data streams, finding a proper size or choosing a forgetting rate without prior knowledge about time-scale of change is not a trivial task. These approaches hence suffer from a trade-off between performance and sensitivity. Existing dynamic sliding window based approaches address this problem by tracking changes in classifier error rate, but are supervised in nature. We propose an efficient semi-supervised framework in this paper which uses change detection on classifier confidence to detect concept drifts, and to determine chunk boundaries dynamically. It also addresses concept evolution problem by detecting outliers having strong cohesion among themselves. Experiment results on benchmark and synthetic data sets show effectiveness of the proposed approach.
Creating Images by Learning Image Semantics Using Vector Space Models
Heath, Derrall (Brigham Young University) | Ventura, Dan (Brigham Young University)
When dealing with images and semantics, most computational systems attempt to automatically extract meaning from images. Here we attempt to go the other direction and autonomously create images that communicate concepts. We present an enhanced semantic model that is used to generate novel images that convey meaning. We employ a vector space model and a large corpus to learn vector representations of words and then train the semantic model to predict word vectors that could describe a given image. Once trained, the model autonomously guides the process of rendering images that convey particular concepts. A significant contribution is that, because of the semantic associations encoded in these word vectors, we can also render images that convey concepts on which the model was not explicitly trained. We evaluate the semantic model with an image clustering technique and demonstrate that the model is successful in creating images that communicate semantic relationships.
Intrinsic and Extrinsic Evaluations of Word Embeddings
Zhai, Michael (Emory University) | Tan, Johnny (Emory University) | Choi, Jinho D. (Emory University)
In this paper, we first analyze the semantic composition of word embeddings by cross-referencing their clusters with the manual lexical database, WordNet. We then evaluate a variety of word embedding approaches by comparing their contributions to two NLP tasks. Our experiments show that the word embedding clusters give high correlations to the synonym and hyponym sets in WordNet, and give 0.88% and 0.17% absolute improvements in accuracy to named entity recognition and part-of-speech tagging, respectively.
Structure Aware L1 Graph for Data Clustering
Han, Shuchu (Stony Brook Univsersity) | Qin, Hong (Stony Brook Univsersity)
In graph-oriented machine learning research, L1 graph is an efficient way to represent the connections of input data samples. Its construction algorithm is based on a numerical optimization motivated by Compressive Sensing theory. As a result, It is a nonparametric method which is highly demanded. However, the information of data such as geometry structure and density distribution are ignored. In this paper, we propose a Structure Aware (SA) L1 graph to improve the data clustering performance by capturing the manifold structure of input data. We use a local dictionary for each datum while calculating its sparse coefficients. SA-L1 graph not only preserves the locality of data but also captures the geometry structure of data. The experimental results show that our new algorithm has better clustering performance than L1 graph.
Teaching Big Data Analytics Skills with Intelligent Workflow Systems
Gil, Yolanda (University of Southern California)
We have designed an open and modular course for data science and big data analytics using a workflow paradigm that allows students to easily experience big data through a sophisticated yet easy to use instrument that is an intelligent workflow system. A key aspect of this work is the use of semantic workflows to capture and reuse end-to-end analytic methods that experts would use to analyze big data, and the use of an intelligent workflow system to elaborate the workflow and manage its execution and resulting datasets. Through the exposure of big data analytics in a workflow framework, students will be able to get first-hand experiences with a breadth of big data topics, including multi-step data analytic and statistical methods, software reuse and composition, parallel distributed programming, high-end computing. In addition, students learn about a range of topics in AI, including semantic representations and ontologies, machine learning, natural language processing, and image analysis.
Video Semantic Clustering with Sparse and Incomplete Tags
Wang, Jingya (Queen Mary University of London) | Zhu, Xiatian (Queen Mary University of London) | Gong, Shaogang (Queen Mary University of London)
Clustering tagged videos into semantic groups is importantbut challenging due to the need for jointly learning correlations between heterogeneous visual and tag data. The taskis made more difficult by inherently sparse and incompletetag labels. In this work, we develop a method for accuratelyclustering tagged videos based on a novel Hierarchical-MultiLabel Random Forest model capable of correlating structured visual and tag information. Specifically, our model exploits hierarchically structured tags of different abstractnessof semantics and multiple tag statistical correlations, thus discovers more accurate semantic correlations among differentvideo data, even with highly sparse/incomplete tags.
Decentralized Robust Subspace Clustering
Liu, Bo (Rutgers, The State University of New Jersey) | Yuan, Xiao-Tong (Nanjing University of Information Science and Technology) | Yu, Yang (Rutgers, The State University of New Jersey) | Liu, Qingshan (Nanjing University of Information Science and Technology) | Metaxas, Dimitris N. (Rutgers, The State University of New Jersey)
We consider the problem of subspace clustering using the SSC (Sparse Subspace Clustering) approach, which has several desirable theoretical properties and has been shown to be effective in various computer vision applications.We develop a large scale distributed framework for the computation of SSC via an alternating direction method of multiplier (ADMM) algorithm. The proposed framework solves SSC in column blocks and only involves parallel multivariate Lasso regression subproblems and sample-wise operations. This appealing property allows us to allocate multiple cores/machines for the processing of individual column blocks.We evaluate our algorithm on a shared-memory architecture. Experimental results on real-world datasets confirm that the proposed block-wise ADMM framework is substantially more efficient than its matrix counterpart used by SSC,without sacrificing accuracy. Moreover, our approach is directly applicable to decentralized neighborhood selection for Gaussian graphical models structure estimation.