AITopics | Accuracy

Collaborating Authors

Accuracy

News Overviews Instructional Materials AI-Alerts Classics

VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text

Hutto, C. J. (Georgia Institute of Technology) | Gilbert, Eric (Georgia Institute of Technology)

AAAI ConferencesMar-23-2014

The inherent nature of social media content poses serious challenges to practical applications of sentiment analysis. We present VADER, a simple rule-based model for general sentiment analysis, and compare its effectiveness to eleven typical state-of-practice benchmarks including LIWC, ANEW, the General Inquirer, SentiWordNet, and machine learning oriented techniques relying on Naive Bayes, Maximum Entropy, and Support Vector Machine (SVM) algorithms. Using a combination of qualitative and quantitative methods, we first construct and empirically validate a gold-standard list of lexical features (along with their associated sentiment intensity measures) which are specifically attuned to sentiment in microblog-like contexts. We then combine these lexical features with consideration for five general rules that embody grammatical and syntactical conventions for expressing and emphasizing sentiment intensity. Interestingly, using our parsimonious rule-based model to assess the sentiment of tweets, we find that VADER outperforms individual human raters (F1 Classification Accuracy = 0.96 and 0.84, respectively), and generalizes more favorably across contexts than any of our benchmarks.

artificial intelligence, machine learning, natural language, (5 more...)

AAAI Conferences

Eighth International AAAI Conference on Weblogs and Social Media

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.80)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.80)
(3 more...)

Add feedback

Text-Based Twitter User Geolocation Prediction

Han, B., Cook, P., Baldwin, T.

Journal of Artificial Intelligence ResearchMar-20-2014

Geographical location is vital to geospatial applications like local search and event detection. In this paper, we investigate and improve on the task of text-based geolocation prediction of Twitter users. Previous studies on this topic have typically assumed that geographical references (e.g., gazetteer terms, dialectal words) in a text are indicative of its authors location. However, these references are often buried in informal, ungrammatical, and multilingual data, and are therefore non-trivial to identify and exploit. We present an integrated geolocation prediction framework and investigate what factors impact on prediction accuracy. First, we evaluate a range of feature selection methods to obtain location indicative words. We then evaluate the impact of non-geotagged tweets, language, and user-declared metadata on geolocation prediction. In addition, we evaluate the impact of temporal variance on model generalisation, and discuss how users differ in terms of their geolocatability. We achieve state-of-the-art results for the text-based Twitter user geolocation task, and also provide the most extensive exploration of the task to date. Our findings provide valuable insights into the design of robust, practical text-based geolocation prediction systems.

accuracy, prediction, tweet, (12 more...)

Journal of Artificial Intelligence Research

doi: 10.1613/jair.4200

AI Access Foundation

10869

Journal of Artificial Intelligence Research

Country:

Europe > Austria > Vienna (0.14)
Asia > South Korea (0.14)
North America > United States > New York (0.04)
(42 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Services (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(3 more...)

Add feedback

The Gaussian Radon Transform and Machine Learning

Holmes, Irina, Sengupta, Ambar

arXiv.org Machine LearningMar-12-2014

There has been growing recent interest in probabilistic interpretations of kernel-based methods as well as learning in Banach spaces. The absence of a useful Lebesgue measure on an infinite-dimensional reproducing kernel Hilbert space is a serious obstacle for such stochastic models. We propose an estimation model for the ridge regression problem within the framework of abstract Wiener spaces and show how the support vector machine solution to such problems can be interpreted in terms of the Gaussian Radon transform.

artificial intelligence, hilbert space, machine learning, (12 more...)

arXiv.org Machine Learning

1310.4794

Country: North America > United States (1.00)

Genre: Research Report (0.50)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.35)

Add feedback

Multi-label ensemble based on variable pairwise constraint projection

Li, Ping, Li, Hong, Wu, Min

arXiv.org Machine LearningMar-8-2014

Multi-label classification has attracted an increasing amount of attention in recent years. To this end, many algorithms have been developed to classify multi-label data in an effective manner. However, they usually do not consider the pairwise relations indicated by sample labels, which actually play important roles in multi-label classification. Inspired by this, we naturally extend the traditional pairwise constraints to the multi-label scenario via a flexible thresholding scheme. Moreover, to improve the generalization ability of the classifier, we adopt a boosting-like strategy to construct a multi-label ensemble from a group of base classifiers. To achieve these goals, this paper presents a novel multi-label classification framework named Variable Pairwise Constraint projection for Multi-label Ensemble (VPCME). Specifically, we take advantage of the variable pairwise constraint projection to learn a lower-dimensional data representation, which preserves the correlations between samples and labels. Thereafter, the base classifiers are trained in the new data space. For the boosting-like strategy, we employ both the variable pairwise constraints and the bootstrap steps to diversify the base classifiers. Empirical studies have shown the superiority of the proposed method in comparison with other approaches.

artificial intelligence, classifier, machine learning, (13 more...)

arXiv.org Machine Learning

doi: 10.1016/j.ins.2012.07.066

1403.1944

Country:

Europe (1.00)
North America > United States > California (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Becoming More Robust to Label Noise with Classifier Diversity

Smith, Michael R., Martinez, Tony

arXiv.org Machine LearningMar-7-2014

It is widely known in the machine learning community that class noise can be (and often is) detrimental to inducing a model of the data. Many current approaches use a single, often biased, measurement to determine if an instance is noisy. A biased measure may work well on certain data sets, but it can also be less effective on a broader set of data sets. In this paper, we present noise identification using classifier diversity (NICD) -- a method for deriving a less biased noise measurement and integrating it into the learning process. To lessen the bias of the noise measure, NICD selects a diverse set of classifiers (based on their predictions of novel instances) to determine which instances are noisy. We examine NICD as a technique for filtering, instance weighting, and selecting the base classifiers of a voting ensemble. We compare NICD with several other noise handling techniques that do not consider classifier diversity on a set of 54 data sets and 5 learning algorithms. NICD significantly increases the classification accuracy over the other considered approaches and is effective across a broad set of data sets and learning algorithms.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

1403.1893

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine > Therapeutic Area (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.36)

Add feedback

Machine Learning at Scale

Izrailev, Sergei, Stanley, Jeremy M.

arXiv.org Machine LearningFeb-25-2014

It takes skill to build a meaningful predictive model even with the abundance of implementations of modern machine learning algorithms and readily available computing resources. Building a model becomes challenging if hundreds of terabytes of data need to be processed to produce the training data set. In a digital advertising technology setting, we are faced with the need to build thousands of such models that predict user behavior and power advertising campaigns in a 24/7 chaotic real-time production environment. As data scientists, we also have to convince other internal departments critical to implementation success, our management, and our customers that our machine learning system works. In this paper, we present the details of the design and implementation of an automated, robust machine learning platform that impacts billions of advertising impressions monthly. This platform enables us to continuously optimize thousands of campaigns over hundreds of millions of users, on multiple continents, against varying performance objectives.

artificial intelligence, machine learning, matrix, (17 more...)

arXiv.org Machine Learning

1402.6076

Country: North America > United States > New York (0.28)

Genre: Research Report > Experimental Study (0.95)

Industry:

Marketing (1.00)
Information Technology > Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

High Dimensional Semiparametric Scale-Invariant Principal Component Analysis

Han, Fang, Liu, Han

arXiv.org Machine LearningFeb-18-2014

We propose a new high dimensional semiparametric principal component analysis (PCA) method, named Copula Component Analysis (COCA). The semiparametric model assumes that, after unspecified marginally monotone transformations, the distributions are multivariate Gaussian. COCA improves upon PCA and sparse PCA in three aspects: (i) It is robust to modeling assumptions; (ii) It is robust to outliers and data contamination; (iii) It is scale-invariant and yields more interpretable results. We prove that the COCA estimators obtain fast estimation rates and are feature selection consistent when the dimension is nearly exponentially large relative to the sample size. Careful experiments confirm that COCA outperforms sparse PCA on both synthetic and real-world datasets.

eigenvector, equation, spearman, (13 more...)

arXiv.org Machine Learning

1402.4507

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > New York (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)
(3 more...)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.67)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Principal Component Analysis (0.61)

Add feedback

An Empirical Evaluation of Ranking Measures With Respect to Robustness to Noise

Berrar, D.

Journal of Artificial Intelligence ResearchFeb-17-2014

Ranking measures play an important role in model evaluation and selection. Using both synthetic and real-world data sets, we investigate how different types and levels of noise affect the area under the ROC curve (AUC), the area under the ROC convex hull, the scored AUC, the Kolmogorov-Smirnov statistic, and the H-measure. In our experiments, the AUC was, overall, the most robust among these measures, thereby reinvigorating it as a reliable metric despite its well-known deficiencies. This paper also introduces a novel ranking measure, which is remarkably robust to noise yet conceptually simple.

experiment, noise, threshold, (14 more...)

Journal of Artificial Intelligence Research

doi: 10.1613/jair.4136

AI Access Foundation

10864

Journal of Artificial Intelligence Research

Country:

Europe > Austria > Vienna (0.14)
North America > United States > New York (0.04)
North America > United States > California > Orange County > Irvine (0.04)
(3 more...)

Genre: Research Report > New Finding (0.88)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Data Science > Data Mining (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.94)

Add feedback

Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease

Duan, Hubert Haoyang

arXiv.org Machine LearningFeb-3-2014

From a fresh data science perspective, this thesis discusses the prediction of coronary artery disease based on genetic variations at the DNA base pair level, called Single-Nucleotide Polymorphisms (SNPs), collected from the Ontario Heart Genomics Study (OHGS). First, the thesis explains two commonly used supervised learning algorithms, the k-Nearest Neighbour (k-NN) and Random Forest classifiers, and includes a complete proof that the k-NN classifier is universally consistent in any finite dimensional normed vector space. Second, the thesis introduces two dimensionality reduction steps, Random Projections, a known feature extraction technique based on the Johnson-Lindenstrauss lemma, and a new method termed Mass Transportation Distance (MTD) Feature Selection for discrete domains. Then, this thesis compares the performance of Random Projections with the k-NN classifier against MTD Feature Selection and Random Forest, for predicting artery disease based on accuracy, the F-Measure, and area under the Receiver Operating Characteristic (ROC) curve. The comparative results demonstrate that MTD Feature Selection with Random Forest is vastly superior to Random Projections and k-NN. The Random Forest classifier is able to obtain an accuracy of 0.6660 and an area under the ROC curve of 0.8562 on the OHGS genetic dataset, when 3335 SNPs are selected by MTD Feature Selection for classification. This area is considerably better than the previous high score of 0.608 obtained by Davies et al. in 2010 on the same dataset.

artificial intelligence, classifier, machine learning, (16 more...)

arXiv.org Machine Learning

1402.0459

Country:

North America > United States (0.67)
North America > Canada > Ontario (0.48)

Genre: Research Report > New Finding (0.66)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Principled Graph Matching Algorithms for Integrating Multiple Data Sources

Zhang, Duo, Rubinstein, Benjamin I. P., Gemmell, Jim

arXiv.org Machine LearningFeb-2-2014

This paper explores combinatorial optimization for problems of max-weight graph matching on multi-partite graphs, which arise in integrating multiple data sources. Entity resolution-the data integration problem of performing noisy joins on structured data-typically proceeds by first hashing each record into zero or more blocks, scoring pairs of records that are co-blocked for similarity, and then matching pairs of sufficient similarity. In the most common case of matching two sources, it is often desirable for the final matching to be one-to-one (a record may be matched with at most one other); members of the database and statistical record linkage communities accomplish such matchings in the final stage by weighted bipartite graph matching on similarity scores. Such matchings are intuitively appealing: they leverage a natural global property of many real-world entity stores-that of being nearly deduped-and are known to provide significant improvements to precision and recall. Unfortunately unlike the bipartite case, exact max-weight matching on multi-partite graphs is known to be NP-hard. Our two-fold algorithmic contributions approximate multi-partite max-weight matching: our first algorithm borrows optimization techniques common to Bayesian probabilistic inference; our second is a greedy approximation algorithm. In addition to a theoretical guarantee on the latter, we present comparisons on a real-world ER problem from Bing significantly larger than typically found in the literature, publication data, and on a series of synthetic problems. Our results quantify significant improvements due to exploiting multiple sources, which are made possible by global one-to-one constraints linking otherwise independent matching sub-problems. We also discover that our algorithms are complementary: one being much more robust under noise, and the other being simple to implement and very fast to run.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

1402.0282

Country: North America > United States (1.00)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment (0.94)
Information Technology (0.68)
Media > Film (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.35)

Add feedback