AITopics | Bloodgood, Michael

Collaborating Authors

Bloodgood, Michael

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Stopping Active Learning based on Predicted Change of F Measure for Text Classification

Altschuler, Michael, Bloodgood, Michael

arXiv.org Machine LearningJan-25-2019

Abstract--During active learning, an effective stopping method allows users to limit the number of annotations, which is cost effective. In this paper, a new stopping method called Predicted Change of F Measure will be introduced that attempts to provide the users an estimate of how much performance of the model is changing at each iteration. This stopping method can be applied with any base learner. This method is useful for reducing the data annotation bottleneck encountered when building text classification systems. I. INTRODUCTION The use of active learning to train machine learning models has been used as a way to reduce annotation costs for text and speech processing applications [1], [2], [3], [4], [5]. Active learning has been shown to have a particularly large potential for reducing annotation cost for text classification [6], [7]. Text classification is one of the most important fields in semantic computing and it has been used in many applications [8], [9], [10], [11], [12]. A. Active Learning Active learning is a form of machine learning that gives the model the ability to select the data on which it wants to learn from and to choose when to end the process of training. In active learning, the model is first provided a small batch of annotated data to be trained on.

active learning, artificial intelligence, text classification, (21 more...)

arXiv.org Machine Learning

1901.09118

Country:

Europe (1.00)
North America > United States > New Jersey > Mercer County > Ewing (0.14)
North America > United States > California > Orange County (0.14)

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.30)

Add feedback

The Use of Unlabeled Data versus Labeled Data for Stopping Active Learning for Text Classification

Beatty, Garrett, Kochis, Ethan, Bloodgood, Michael

arXiv.org Machine LearningJan-25-2019

Abstract-- Annotation of training data is the major bottleneck in the creation of text classification systems. Active learning is a commonly used technique to reduce the amount of training data one needs to label. A crucial aspect of active learning is determining when to stop labeling data. Three potential sources for informing when to stop active learning are an additional labeled set of data, an unlabeled set of data, and the training data that is labeled during the process of active learning. To date, no one has compared and contrasted the advantages and disadvantages of stopping methods based on these three information sources. We find that stopping methods that use unlabeled data are more effective than methods that use labeled data. I. INTRODUCTION The use of active learning to train machine learning models has been used as a way to reduce annotation costs for text and speech processing applications [1], [2], [3], [4], [5]. Active learning has been shown to have a particularly large potential for reducing annotation cost for text classification [6], [7]. Text classification is one of the most important fields in semantic computing and it has been used in many applications [8], [9], [10], [11], [12].

active learning, artificial intelligence, inductive learning, (19 more...)

arXiv.org Machine Learning

1901.09126

Country:

Europe (1.00)
North America > United States > New Jersey > Mercer County > Ewing (0.14)
North America > United States > California > Orange County (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Impact of Batch Size on Stopping Active Learning for Text Classification

Beatty, Garrett, Kochis, Ethan, Bloodgood, Michael

arXiv.org Machine LearningJan-24-2018

When using active learning, smaller batch sizes are typically more efficient from a learning efficiency perspective. However, in practice due to speed and human annotator considerations, the use of larger batch sizes is necessary. While past work has shown that larger batch sizes decrease learning efficiency from a learning curve perspective, it remains an open question how batch size impacts methods for stopping active learning. We find that large batch sizes degrade the performance of a leading stopping method over and above the degradation that results from reduced learning efficiency. We analyze this degradation and find that it can be mitigated by changing the window size parameter of how many past iterations of learning are taken into account when making the stopping decision. We find that when using larger batch sizes, stopping methods are more effective when smaller window sizes are used.

artificial intelligence, batch size, text classification, (18 more...)

arXiv.org Machine Learning

1801.07887

Country:

North America > United States > New Jersey > Mercer County > Ewing (0.17)
North America > United States > California > Orange County > Laguna Hills (0.15)

Genre: Research Report > New Finding (0.30)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.42)

Add feedback

Support Vector Machine Active Learning Algorithms with Query-by-Committee versus Closest-to-Hyperplane Selection

Bloodgood, Michael

arXiv.org Machine LearningJan-24-2018

This paper investigates and evaluates support vector machine active learning algorithms for use with imbalanced datasets, which commonly arise in many applications such as information extraction applications. Algorithms based on closest-to-hyperplane selection and query-by-committee selection are combined with methods for addressing imbalance such as positive amplification based on prevalence statistics from initial random samples. Three algorithms (ClosestPA, QBagPA, and QBoostPA) are presented and carefully evaluated on datasets for text classification and relation extraction. The ClosestPA algorithm is shown to consistently outperform the other two in a variety of ways and insights are provided as to why this is the case.

active learning, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1801.07875

Country:

Europe (1.00)
North America > United States > New Jersey > Mercer County > Ewing (0.14)
North America > United States > California > Orange County > Laguna Hills (0.14)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (1.00)

Add feedback

Using Global Constraints and Reranking to Improve Cognates Detection

Bloodgood, Michael, Strauss, Benjamin

arXiv.org Machine LearningAug-19-2017

Global constraints and reranking have not been used in cognates detection research to date. We propose methods for using global constraints by performing rescoring of the score matrices produced by state of the art cognates detection systems. Using global constraints to perform rescoring is complementary to state of the art methods for performing cognates detection and results in significant performance improvements beyond current state of the art performance on publicly available datasets with different language pairs and various conditions such as different levels of baseline state of the art performance and different data size conditions, including with more realistic large data size conditions than have been evaluated with in the past.

artificial intelligence, computational linguistics, constraint-based reasoning, (15 more...)

arXiv.org Machine Learning

doi: 10.18653/v1/P17-1181

1704.0705

Country:

Europe (1.00)
North America > United States > New Jersey > Mercer County > Ewing (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)

Genre: Research Report > Promising Solution (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Filtering Tweets for Social Unrest

Mishler, Alan, Wonus, Kevin, Chambers, Wendy, Bloodgood, Michael

arXiv.org Machine LearningApr-1-2017

Since the events of the Arab Spring, there has been increased interest in using social media to anticipate social unrest. While efforts have been made toward automated unrest prediction, we focus on filtering the vast volume of tweets to identify tweets relevant to unrest, which can be provided to downstream users for further analysis. We train a supervised classifier that is able to label Arabic language tweets as relevant to unrest with high reliability. We examine the relationship between training data size and performance and investigate ways to optimize the model building process while minimizing cost. We also explore how confidence thresholds can be set to achieve desired levels of performance.

immunology, law enforcement, tweet, (23 more...)

arXiv.org Machine Learning

doi: 10.1109/ICSC.2017.75

1702.06216

Country:

Europe (0.68)
North America > United States > Maryland > Prince George's County > College Park (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
North America > United States > New Jersey > Mercer County > Ewing (0.14)

Genre: Research Report (0.83)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology (0.93)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.47)
Health & Medicine > Therapeutic Area > Immunology (0.47)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.70)

Add feedback

Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection

Bloodgood, Michael, Strauss, Benjamin

arXiv.org Machine LearningApr-11-2016

Many important forms of data are stored digitally in XML format. Errors can occur in the textual content of the data in the fields of the XML. Fixing these errors manually is time-consuming and expensive, especially for large amounts of data. There is increasing interest in the research, development, and use of automated techniques for assisting with data cleaning. Electronic dictionaries are an important form of data frequently stored in XML format that frequently have errors introduced through a mixture of manual typographical entry errors and optical character recognition errors. In this paper we describe methods for flagging statistical anomalies as likely errors in electronic dictionaries stored in XML format. We describe six systems based on different sources of information. The systems detect errors using various signals in the data including uncommon characters, text length, character-based language models, word-based language models, tied-field length ratios, and tied-field transliteration models. Four of the systems detect errors based on expectations automatically inferred from content within elements of a single field type. We call these single-field systems. Two of the systems detect errors based on correspondence expectations automatically inferred from content within elements of multiple related field types. We call these tied-field systems. For each system, we provide an intuitive analysis of the type of error that it is successful at detecting. Finally, we describe two larger-scale evaluations using crowdsourcing with Amazon's Mechanical Turk platform and using the annotations of a domain expert. The evaluations consistently show that the systems are useful for improving the efficiency with which errors in XML electronic dictionaries can be detected.

crowdsourcing, survey article, threshold, (21 more...)

arXiv.org Machine Learning

doi: 10.1109/ICSC.2016.38

1602.07807

Country:

Europe (1.00)
North America > United States > California (0.28)
North America > United States > Maryland > Prince George's County > College Park (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science > Data Quality > Data Cleaning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.60)

Add feedback

Analysis of Stopping Active Learning based on Stabilizing Predictions

Bloodgood, Michael, Grothendieck, John

arXiv.org Machine LearningApr-23-2015

Within the natural language processing (NLP) community, active learning has been widely investigated and applied in order to alleviate the annotation bottleneck faced by developers of new NLP systems and technologies. This paper presents the first theoretical analysis of stopping active learning based on stabilizing predictions (SP). The analysis has revealed three elements that are central to the success of the SP method: (1) bounds on Cohen's Kappa agreement between successively trained models impose bounds on differences in F-measure performance of the models; (2) since the stop set does not have to be labeled, it can be made large in practice, helping to guarantee that the results transfer to previously unseen streams of examples at test/application time; and (3) good (low variance) sample estimates of Kappa between successive models can be obtained. Proofs of relationships between the level of Kappa agreement and the difference in performance between consecutive models are presented. Specifically, if the Kappa agreement between two models exceeds a threshold T (where $T>0$), then the difference in F-measure performance between those models is bounded above by $\frac{4(1-T)}{T}$ in all cases. If precision of the positive conjunction of the models is assumed to be $p$, then the bound can be tightened to $\frac{4(1-T)}{(p+1)T}$.

artificial intelligence, proceedings, text processing, (16 more...)

arXiv.org Machine Learning

1504.06329

Country:

Europe (1.00)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Maryland > Prince George's County > College Park (0.14)
North America > United States > Delaware > New Castle County > Newark (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.31)

Add feedback

Statistical modality tagging from rule-based annotations and crowdsourcing

Prabhakaran, Vinodkumar, Bloodgood, Michael, Diab, Mona, Dorr, Bonnie, Levin, Lori, Piatko, Christine D., Rambow, Owen, Van Durme, Benjamin

arXiv.org Machine LearningMar-3-2015

We explore training an automatic modality tagger. Modality is the attitude that a speaker might have toward an event or state. One of the main hurdles for training a linguistic tagger is gathering training data. This is particularly problematic for training a tagger for modality because modality triggers are sparse for the overwhelming majority of sentences. We investigate an approach to automatically training a modality tagger where we first gathered sentences based on a high-recall simple rule-based modality tagger and then provided these sentences to Mechanical Turk annotators for further annotation. We used the resulting set of training data to train a precise modality tagger using a multi-class SVM that delivers good performance.

artificial intelligence, modality, natural language, (18 more...)

arXiv.org Machine Learning

1503.0119

Country:

Asia (1.00)
North America > United States > Colorado (0.14)
North America > United States > Oregon (0.14)
(4 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media > Crowdsourcing (0.70)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Add feedback

Use of Modality and Negation in Semantically-Informed Syntactic MT

Baker, Kathryn, Bloodgood, Michael, Dorr, Bonnie J., Callison-Burch, Chris, Filardo, Nathaniel W., Piatko, Christine, Levin, Lori, Miller, Scott

arXiv.org Machine LearningFeb-5-2015

This paper describes the resource- and system-building efforts of an eight-week Johns Hopkins University Human Language Technology Center of Excellence Summer Camp for Applied Language Exploration (SCALE-2009) on Semantically-Informed Machine Translation (SIMT). We describe a new modality/negation (MN) annotation scheme, the creation of a (publicly available) MN lexicon, and two automated MN taggers that we built using the annotation scheme and lexicon. Our annotation scheme isolates three components of modality and negation: a trigger (a word that conveys modality or negation), a target (an action associated with modality or negation) and a holder (an experiencer of modality). We describe how our MN lexicon was semi-automatically produced and we demonstrate that a structure-based MN tagger results in precision around 86% (depending on genre) for tagging of a standard LDC data set. We apply our MN annotation scheme to statistical machine translation using a syntactic framework that supports the inclusion of semantic annotations. Syntactic tags enriched with semantic annotations are assigned to parse trees in the target-language training texts through a process of tree grafting. While the focus of our work is modality and negation, the tree grafting procedure is general and supports other types of semantic information. We exploit this capability by including named entities, produced by a pre-existing tagger, in addition to the MN elements produced by the taggers described in this paper. The resulting system significantly outperformed a linguistically naive baseline model (Hiero), and reached the highest scores yet reported on the NIST 2009 Urdu-English test set. This finding supports the hypothesis that both syntactic and semantic information can improve translation quality.

machine translation, modality, text processing, (16 more...)

arXiv.org Machine Learning

1502.01682

Country:

Europe (1.00)
Asia > Middle East (0.93)
Africa (0.93)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Government > Military (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback