AITopics

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)

arXiv.org Machine LearningAug-20-2020

Demographics Should Not Be the Reason of Toxicity: Mitigating Discrimination in Text Classifications with Instance Weighting

Zhang, Guanhua, Bai, Bing, Zhang, Junqi, Bai, Kun, Zhu, Conghui, Zhao, Tiejun

With the recent proliferation of the use of text classifications, researchers have found that there are certain unintended biases in text classification datasets. For example, texts containing some demographic identity-terms (e.g., "gay", "black") are more likely to be abusive in existing abusive language detection datasets. As a result, models trained with these datasets may consider sentences like "She makes me happy to be gay" as abusive simply because of the word "gay." In this paper, we formalize the unintended biases in text classification datasets as a kind of selection bias from the non-discrimination distribution to the discrimination distribution. Based on this formalization, we further propose a model-agnostic debiasing training framework by recovering the non-discrimination distribution using instance weighting, which does not require any extra resources or annotations apart from a pre-defined set of demographic identity-terms. Experiments demonstrate that our method can effectively alleviate the impacts of the unintended biases without significantly hurting models' generalization ability.

dataset, proceedings, unintended bias, (14 more...)

2004.14088

Country: Asia > China > Heilongjiang Province > Harbin (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Ragesh, Rahul, Sellamanickam, Sundararajan, Iyer, Arun, Bairi, Ram, Lingam, Vijay

HeteGCN: Heterogeneous Graph Convolutional Networks for Text Classification

arXiv.org Machine LearningAug-19-2020

We consider the problem of learning efficient and inductive graph convolutional networks for text classification with a large number of examples and features. Existing state-of-the-art graph embedding based methods such as predictive text embedding (PTE) and TextGCN have shortcomings in terms of predictive performance, scalability and inductive capability. To address these limitations, we propose a heterogeneous graph convolutional network (HeteGCN) modeling approach that unites the best aspects of PTE and TextGCN together. The main idea is to learn feature embeddings and derive document embeddings using a HeteGCN architecture with different graphs used across layers. We simplify TextGCN by dissecting into several HeteGCN models which (a) helps to study the usefulness of individual models and (b) offers flexibility in fusing learned embeddings from different models. In effect, the number of model parameters is reduced significantly, enabling faster training and improving performance in small labeled training set scenario. Our detailed experimental studies demonstrate the efficacy of the proposed approach.

machine learning, natural language, text classification, (17 more...)

2008.12842

Country:

North America > United States > New York > New York County > New York City (0.05)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > Barbados > Saint Michael > Bridgetown (0.04)
(5 more...)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

#artificialintelligenceAug-16-2020, 22:11:47 GMT

Text Classification with Hugging Face Transformers in TensorFlow 2 (Without Tears)

The Hugging Face transformers package is an immensely popular Python library providing pretrained models that are extraordinarily useful for a variety of natural language processing (NLP) tasks. It previously supported only PyTorch, but, as of late 2019, TensorFlow 2 is supported as well. While the library can be used for many tasks from Natural Language Inference (NLI) to Question-Answering, text classification remains one of the most popular and practical use cases. The ktrain library is a lightweight wrapper for tf.keras in TensorFlow 2. It is designed to make deep learning and AI more accessible and easier to apply for beginners and domain experts. As of version 0.8, ktrain now includes a simplified interface to Hugging Face transformers for text classification.

library, machine learning, natural language, (14 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

arXiv.org Machine LearningAug-14-2020

Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Clusters for Extreme Multi-label Text Classification

Ye, Hui, Chen, Zhiyu, Wang, Da-Han, Davison, Brian D.

Extreme multi-label text classification (XMTC) is a task for tagging a given text with the most relevant labels from an extremely large label set. We propose a novel deep learning method called APLC-XLNet. Our approach fine-tunes the recently released generalized autoregressive pretrained model (XLNet) to learn a dense representation for the input text. We propose Adaptive Probabilistic Label Clusters (APLC) to approximate the cross entropy loss by exploiting the unbalanced label distribution to form clusters that explicitly reduce the computational time. Our experiments, carried out on five benchmark datasets, show that our approach has achieved new state-of-the-art results on four benchmark datasets. Our source code is available publicly at https://github.com/huiyegit/APLC_XLNet.

large language model, machine learning, natural language, (5 more...)

2007.02439

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.64)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)

arXiv.org Machine LearningJul-29-2020

Text-based classification of interviews for mental health -- juxtaposing the state of the art

Wouts, Joppe Valentijn

Currently, the state of the art for classification of psychiatric illness is based on audio-based classification. This thesis aims to design and evaluate a state of the art text classification network on this challenge. The hypothesis is that a well designed text-based approach poses a strong competition against the state-of-the-art audio based approaches. Dutch natural language models are being limited by the scarcity of pre-trained monolingual NLP models, as a result Dutch natural language models have a low capture of long range semantic dependencies over sentences. For this issue, this thesis presents belabBERT, a new Dutch language model extending the RoBERTa[15] architecture. belabBERT is trained on a large Dutch corpus (+32GB) of web crawled texts. After this thesis evaluates the strength of text-based classification, a brief exploration is done, extending the framework to a hybrid text- and audio-based classification. The goal of this hybrid framework is to show the principle of hybridisation with a very basic audio-classification network. The overall goal is to create the foundations for a hybrid psychiatric illness classification, by proving that the new text-based classification is already a strong stand-alone solution.

classification, machine learning, natural language, (21 more...)

2008.01543

Country:

Europe > Netherlands > North Holland > Amsterdam (0.05)
Europe > United Kingdom > England > Greater London > London (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

#artificialintelligenceJul-25-2020, 00:11:06 GMT

Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT

In this article, using NLP and Python, I will explain 3 different strategies for text multiclass classification: the old-fashioned Bag-of-Words (with Tf-Idf), the famous Word Embedding (with Word2Vec), and the cutting edge Language models (with BERT). NLP (Natural Language Processing) is the field of artificial intelligence that studies the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data. NLP is often applied for classifying text data. Text classification is the problem of assigning categories to text data according to its content. There are different techniques to extract information from raw text data and use it to train a classification model.

machine learning, natural language, tf-idf vs word2vec vs bert, (7 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.64)

#artificialintelligenceJun-4-2020, 09:33:53 GMT

Text Classification using Neural Networks

Understanding how chatbots work is important. A fundamental piece of machinery inside a chat-bot is the text classifier. Let's look at the inner workings of an artificial neural network (ANN) for text classification.

chatbot, machine learning, text classification, (3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.88)

arXiv.org Machine LearningMay-29-2020

SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions

Ye, Mao, Gong, Chengyue, Liu, Qiang

State-of-the-art NLP models can often be fooled by human-unaware transformations such as synonymous word substitution. For security reasons, it is of critical importance to develop models with certified robustness that can provably guarantee that the prediction is can not be altered by any possible synonymous word substitution. In this work, we propose a certified robust method based on a new randomized smoothing technique, which constructs a stochastic ensemble by applying random word substitutions on the input sentences, and leverage the statistical properties of the ensemble to provably certify the robustness. Our method is simple and structure-free in that it only requires the black-box queries of the model outputs, and hence can be applied to any pre-trained models (such as BERT) and any types of models (world-level or subword-level). Our method significantly outperforms recent state-of-the-art methods for certified robustness on both IMDB and Amazon text classification tasks. To the best of our knowledge, we are the first work to achieve certified robustness on large systems such as BERT with practically meaningful certified accuracy.

accuracy, machine learning, natural language, (20 more...)

2005.14424

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.49)

González-Carvajal, Santiago, Garrido-Merchán, Eduardo C.

Comparing BERT against traditional machine learning text classification

arXiv.org Machine LearningMay-26-2020

The BERT model has arisen as a popular state-of-the-art machine learning model in the recent years that is able to cope with multiple NLP tasks such as supervised text classification without human supervision. Its flexibility to cope with any type of corpus delivering great results has make this approach very popular not only in academia but also in the industry. Although, there are lots of different approaches that have been used throughout the years with success. In this work, we first present BERT and include a little review on classical NLP approaches. Then, we empirically test with a suite of experiments dealing different scenarios the behaviour of BERT against the traditional TF-IDF vocabulary fed to machine learning algorithms. Our purpose of this work is to add empirical evidence to support or refuse the use of BERT as a default on NLP tasks. Experiments show the superiority of BERT and its independence of features of the NLP problem such as the language of the text adding empirical evidence to use BERT as a default technique to be used in NLP problems.

machine learning, natural language, text classification, (19 more...)

2005.13012

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.06)
Europe > Spain > Galicia > Madrid (0.05)
Europe > France > Bourgogne-Franche-Comté > Doubs > Besançon (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.74)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)