AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

Machine Learning for Seizure Type Classification: Setting the benchmark

Roy, Subhrajit, Asif, Umar, Tang, Jianbin, Harrer, Stefan

arXiv.org Machine LearningFeb-3-2019

Accurate classification of seizure types plays a crucial role in the treatment and disease management of epileptic patients. Epileptic seizure type not only impacts on the choice of drugs but also on the range of activities a patient can safely engage in. With recent advances being made towards artificial intelligence enabled automatic seizure detection, the next frontier is the automatic classification of seizure types. On that note, in this paper, we undertake the first study to explore the application of machine learning algorithms for multi-class seizure type classification. We used the recently released TUH EEG Seizure Corpus and conducted a thorough search space exploration to evaluate the performance of a combination of various pre-processing techniques, machine learning algorithms, and corresponding hyperparameters on this task. We show that our algorithms can reach a weighted F1 score of up to 0.907 thereby setting the first benchmark for scalp EEG based multi-class seizure type classification.

classification, seizure, seizure type classification, (9 more...)

arXiv.org Machine Learning

1902.01012

Country:

Oceania > Australia (0.04)
North America > United States > Nevada > Clark County > Las Vegas (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)

Genre: Research Report (0.65)

Industry: Health & Medicine > Therapeutic Area > Neurology > Epilepsy (0.71)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.73)

Add feedback

The Use of Unlabeled Data versus Labeled Data for Stopping Active Learning for Text Classification

Beatty, Garrett, Kochis, Ethan, Bloodgood, Michael

arXiv.org Machine LearningJan-25-2019

Abstract-- Annotation of training data is the major bottleneck in the creation of text classification systems. Active learning is a commonly used technique to reduce the amount of training data one needs to label. A crucial aspect of active learning is determining when to stop labeling data. Three potential sources for informing when to stop active learning are an additional labeled set of data, an unlabeled set of data, and the training data that is labeled during the process of active learning. To date, no one has compared and contrasted the advantages and disadvantages of stopping methods based on these three information sources. We find that stopping methods that use unlabeled data are more effective than methods that use labeled data. I. INTRODUCTION The use of active learning to train machine learning models has been used as a way to reduce annotation costs for text and speech processing applications [1], [2], [3], [4], [5]. Active learning has been shown to have a particularly large potential for reducing annotation cost for text classification [6], [7]. Text classification is one of the most important fields in semantic computing and it has been used in many applications [8], [9], [10], [11], [12].

active learning, learning, validation, (14 more...)

arXiv.org Machine Learning

1901.09126

Country:

North America > United States > New Jersey > Mercer County > Ewing (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York > New York County > New York City (0.04)
(10 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.68)

Add feedback

Stopping Active Learning based on Predicted Change of F Measure for Text Classification

Altschuler, Michael, Bloodgood, Michael

arXiv.org Machine LearningJan-25-2019

Abstract--During active learning, an effective stopping method allows users to limit the number of annotations, which is cost effective. In this paper, a new stopping method called Predicted Change of F Measure will be introduced that attempts to provide the users an estimate of how much performance of the model is changing at each iteration. This stopping method can be applied with any base learner. This method is useful for reducing the data annotation bottleneck encountered when building text classification systems. I. INTRODUCTION The use of active learning to train machine learning models has been used as a way to reduce annotation costs for text and speech processing applications [1], [2], [3], [4], [5]. Active learning has been shown to have a particularly large potential for reducing annotation cost for text classification [6], [7]. Text classification is one of the most important fields in semantic computing and it has been used in many applications [8], [9], [10], [11], [12]. A. Active Learning Active learning is a form of machine learning that gives the model the ability to select the data on which it wants to learn from and to choose when to end the process of training. In active learning, the model is first provided a small batch of annotated data to be trained on. Then, in each following iteration, the model selects a small batch and removes this batch from a large unlabeled set of examples.

active learning, annotation, proceedings, (14 more...)

arXiv.org Machine Learning

1901.09118

Country:

North America > United States > New Jersey > Mercer County > Ewing (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(10 more...)

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.30)

Add feedback

Overlooked No More: Karen Sparck Jones, Who Established the Basis for Search Engines

#artificialintelligenceJan-3-2019, 19:44:36 GMT

"All words in a natural language are ambiguous; they have multiple senses," she said in an oral history interview for the History Center of the Institute of Electrical and Electronics Engineers. "How do you find out which sense they've got in any particular use?" In 1964, Sparck Jones published "Synonymy and Semantic Classification," which is now seen as a foundational paper in the field of natural language processing. In 1972, she introduced the concept of inverse document frequency, which counts the number of times a term is used in a document in order to determine the term's importance; it, too, is a foundation of modern search engines. Sparck Jones began working on early speech recognition systems in the 1980s.

information retrieval, natural language, text classification, (7 more...)

#artificialintelligence

Genre: Personal > Obituary (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.63)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.59)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.59)

Add feedback

Weakly-Supervised Hierarchical Text Classification

Meng, Yu, Shen, Jiaming, Zhang, Chao, Han, Jiawei

arXiv.org Artificial IntelligenceDec-28-2018

Hierarchical text classification, which aims to classify text documents into a given hierarchy, is an important task in many real-world applications. Recently, deep neural models are gaining increasing popularity for text classification due to their expressive power and minimum requirement for feature engineering. However, applying deep neural networks for hierarchical text classification remains challenging, because they heavily rely on a large amount of training data and meanwhile cannot easily determine appropriate levels of documents in the hierarchical setting. In this paper, we propose a weakly-supervised neural method for hierarchical text classification. Our method does not require a large amount of training data but requires only easy-to-provide weak supervision signals such as a few class-related documents or keywords. Our method effectively leverages such weak supervision signals to generate pseudo documents for model pre-training, and then performs self-training on real unlabeled data to iteratively refine the model. During the training process, our model features a hierarchical neural structure, which mimics the given hierarchy and is capable of determining the proper levels for documents with a blocking mechanism. Experiments on three datasets from different domains demonstrate the efficacy of our method compared with a comprehensive set of baselines.

classification, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

1812.1127

Country:

North America > United States > Illinois > Champaign County > Urbana (0.04)
Africa > Middle East > Libya > Benghazi District > Benghazi (0.04)

Genre: Research Report (0.82)

Industry: Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Reproducible evaluation of diffusion MRI features for automatic classification of patients with Alzheimers disease

Wen, Junhao, Samper-Gonzalez, Jorge, Bottani, Simona, Routier, Alexandre, Burgos, Ninon, Jacquemont, Thomas, Fontanella, Sabrina, Durrleman, Stanley, Epelbaum, Stephane, Bertrand, Anne, Colliot, Olivier

arXiv.org Machine LearningDec-28-2018

Diffusion MRI is the modality of choice to study alterations of white matter. In the past years, various works have used diffusion MRI for automatic classification of Alzheimers disease. However, the performances obtained with different approaches are difficult to compare because of variations in components such as input data, participant selection, image preprocessing, feature extraction, feature selection (FS) and cross-validation (CV) procedure. Moreover, these studies are also difficult to reproduce because these different components are not readily available. In a previous work (Samper-Gonzalez et al. 2018), we proposed an open-source framework for the reproducible evaluation of AD classification from T1-weighted (T1w) MRI and PET data. In the present paper, we extend this framework to diffusion MRI data. The framework comprises: tools to automatically convert ADNI data into the BIDS standard, pipelines for image preprocessing and feature extraction, baseline classifiers and a rigorous CV procedure. We demonstrate the use of the framework through assessing the influence of diffusion tensor imaging (DTI) metrics (fractional anisotropy - FA, mean diffusivity - MD), feature types, imaging modalities (diffusion MRI or T1w MRI), data imbalance and FS bias. First, voxel-wise features generally gave better performances than regional features. Secondly, FA and MD provided comparable results for voxel-wise features. Thirdly, T1w MRI performed better than diffusion MRI. Fourthly, we demonstrated that using non-nested validation of FS leads to unreliable and over-optimistic results. All the code is publicly available: general-purpose tools have been integrated into the Clinica software (www.clinica.run) and the paper-specific code is available at: https://gitlab.icm-institute.org/aramislab/AD-ML.

alzheimer, classification, diffusion mri, (14 more...)

arXiv.org Machine Learning

1812.11183

Country:

Europe > France > Île-de-France > Paris > Paris (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > California > San Mateo County > Menlo Park (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (1.00)
Health & Medicine > Health Care Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Machine Learning in Official Statistics

Beck, Martin, Dumpert, Florian, Feuerhake, Joerg

arXiv.org Machine LearningDec-13-2018

On 10 October 2017, the development of a Digital Agenda of the Federal Statistical Office of Germany (Destatis) has started (Statistisches Bundesamt 2018). One of many topics that were intensively discussed was Machine Learning. In a meeting at 13-15 November 2017, the office and department heads of Destatis evaluated and prioritised 59 measures of the Digital Agenda according to their benefits and costs. A "Proof of Concept Machine Learning" was given high priority and classified as one of four lighthouse projects of the Digital Agenda. The content specification was "Proof of Concept Machine Learning - Set up Proof of Concept for Machine Learning, e.g. in business statistics, to perform automatic categorization and improve analysis potential". The deadline for completion of the project was set for mid-2018.

joerg feuerhake, machine learning, statistics institution project name description, (8 more...)

arXiv.org Machine Learning

1812.10422

Country:

North America > United States (1.00)
Oceania > Australia (0.14)
Europe > Latvia (0.14)
(33 more...)

Genre:

Questionnaire & Opinion Survey (1.00)
Overview (1.00)
Research Report > New Finding (0.93)

Industry:

Law (1.00)
Information Technology (1.00)
Government > Voting & Elections (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(3 more...)

Add feedback

Exploiting Coarse-to-Fine Task Transfer for Aspect-level Sentiment Classification

Li, Zheng, Wei, Ying, Zhang, Yu, Zhang, Xiang, Li, Xin, Yang, Qiang

arXiv.org Machine LearningNov-16-2018

Aspect-level sentiment classification (ASC) aims at identifying sentiment polarities towards aspects in a sentence, where the aspect can behave as a general Aspect Category (AC) or a specific Aspect Term (AT). However, due to the especially expensive and labor-intensive labeling, existing public corpora in AT-level are all relatively small. Meanwhile, most of the previous methods rely on complicated structures with given scarce data, which largely limits the efficacy of the neural models. In this paper, we exploit a new direction named coarse-to-fine task transfer, which aims to leverage knowledge learned from a rich-resource source domain of the coarse-grained AC task, which is more easily accessible, to improve the learning in a low-resource target domain of the fine-grained AT task. To resolve both the aspect granularity inconsistency and feature mismatch between domains, we propose a Multi-Granularity Alignment Network (MGAN). In MGAN, a novel Coarse2Fine attention guided by an auxiliary task can help the AC task modeling at the same fine-grained level with the AT task. To alleviate the feature false alignment, a contrastive feature alignment method is adopted to align aspect-specific feature representations semantically. In addition, a large-scale multi-domain dataset for the AC task is provided. Empirically, extensive experiments demonstrate the effectiveness of the MGAN.

machine learning, natural language, text classification, (19 more...)

arXiv.org Machine Learning

1811.10999

Genre: Research Report (0.83)

Industry: Consumer Products & Services > Restaurants (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.72)

Add feedback

Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks

Collins, Edward, Rozanov, Nikolai, Zhang, Bingbing

arXiv.org Artificial IntelligenceNov-5-2018

Classification tasks are usually analysed and improved through new model architectures or hyperparameter optimisation but the underlying properties of datasets are discovered on an ad-hoc basis as errors occur. However, understanding the properties of the data is crucial in perfecting models. In this paper we analyse exactly which characteristics of a dataset best determine how difficult that dataset is for the task of text classification. We then propose an intuitive measure of difficulty for text classification datasets which is simple and fast to calculate. We show that this measure generalises to unseen data by comparing it to state-of-the-art datasets and results. This measure can be used to analyse the precise source of errors in a dataset and allows fast estimation of how difficult a dataset is to learn. We searched for this measure by training 12 classical and neural network based models on 78 real-world datasets, then use a genetic algorithm to discover the best measure of difficulty. Our difficulty-calculating code ( https://github.com/Wluper/edm ) and datasets ( http://data.wluper.com ) are publicly available.

machine learning, natural language, text classification, (17 more...)

arXiv.org Artificial Intelligence

1811.0191

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Industry: Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Compositional coding capsule network with k-means routing for text classification

Ren, Hao, Lu, Hong

arXiv.org Machine LearningOct-29-2018

Text classification is a challenging problem which aims to identify the category of texts. Recently, Capsule Networks (CapsNets) are proposed for image classification. It has been shown that CapsNets have several advantages over Convolutional Neural Networks (CNNs), while, their validity in the domain of text has less been explored. An effective method named deep compositional code learning has been proposed lately. This method can save many parameters about word embeddings without any significant sacrifices in performance. In this paper, we introduce the Compositional Coding (CC) mechanism between capsules, and we propose a new routing algorithm, which is based on k-means clustering theory. Experiments conducted on eight challenging text classification datasets show the proposed method achieves competitive accuracy compared to the state-of-the-art approach with significantly fewer parameters.

machine learning, natural language, text classification, (19 more...)

arXiv.org Machine Learning

1810.09177

Country:

Africa > Nigeria (0.15)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > Promising Solution (0.36)

Industry: Banking & Finance (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback