Liu, Si (Chinese Academy of Science) | Liu, Hairong (National University of Singapore) | Latecki, Longin Jan (Temple University) | Yan, Shuicheng (National University of Singapore) | Xu, Changsheng (China-Singapore Institute of Digital Media) | Lu, Hanqing (Chinese Academy of Science)
In this paper, we propose a novel method to select the most informativesubset of features, which has little redundancy andvery strong discriminating power. Our proposed approach automaticallydetermines the optimal number of features and selectsthe best subset accordingly by maximizing the averagepairwise informativeness, thus has obvious advantage overtraditional filter methods. By relaxing the essential combinatorialoptimization problem into the standard quadratic programmingproblem, the most informative feature subset canbe obtained efficiently, and a strategy to dynamically computethe redundancy between feature pairs further greatly acceleratesour method through avoiding unnecessary computationsof mutual information. As shown by the extensive experiments,the proposed method can successfully select the mostinformative subset of features, and the obtained classificationresults significantly outperform the state-of-the-art results onmost test datasets.
Caragea, Cornelia (Pennsylvania State University) | Silvescu, Adrian (Naviance Inc.) | Kataria, Saurabh (Pennsylvania State University) | Caragea, Doina (Kansas State University) | Mitra, Prasenjit (Pennsylvania State University)
With the exponential increase in the number of documents available online, e.g., news articles, weblogs, scientific documents, effective and efficient classification methods are required in order to deliver the appropriate information to specific users or groups. The performance of document classifiers critically depends, among other things, on the choice of the feature representation. The commonly used "bag of words" representation can result in a large number of features. Feature abstraction helps reduce a classifier input size by learning an abstraction hierarchy over the set of words. A cut through the hierarchy specifies a compressed model, where the nodes on the cut represent abstract features. In this paper, we compare feature abstraction with two other methods for dimensionality reduction, i.e., feature selection and Latent Dirichlet Allocation (LDA). Experimental results on two data sets of scientific publications show that classifiers trained using abstract features significantly outperform those trained using features that have the highest average mutual information with the class, and those trained using the topic distribution and topic words output by LDA. Furthermore, we propose an approach to automatic identification of a cut in order to trade off the complexity of classifiers against their performance. Our results demonstrate the feasibility of the proposed approach.
The Systems Technologies Lab is part of the Imagination Lab within Adobe Research - a team of world-class researchers in machine learning, data mining, econometrics and social networking. Beyond Adobe's traditional strengths in media technologies, this team is focused on research opportunities in areas related to digital marketing and media optimization (e.g., Adobe's Digital Marketing Cloud). This researcher position will focus on discovering innovative approaches to leveraging advanced statistical and econometric modeling techniques to perform marketing mix modeling research on multiple massive datasets. Additionally, this researcher will identify key measures, approaches and methodologies for measuring traditional and digital (online, interactive) marketing campaign success against business objectives.
IoT and data science are intertwined, as sensor data from wearables, transportation or healthcare systems, manufacturing and engineering, needs to be collected, refined, aggregated and processed by automated data science systems to deliver insights and value. Here we list of few popular articles related to IoT, published this year on IotCentral.io. A number of IoT articles can also be found on Data Science Central: click here to access the articles listed in the picture below, and many more.
This is a new series, featuring great content from our top contributors. Some of these articles are rather technical in nature, but many are business-oriented and written in simple English. The entire series consists of about 120 articles. We intend to publish a new set every two weeks or so. This is the first edition. To read more articles from a same author, read one of his/her articles and click on his/her profile picture to access the full list. Some of these articles are curated or posted as guest blogs.