Goto

Collaborating Authors

 Text Classification


Adding Interpretability to Multiclass Text Classification models

#artificialintelligence

Explain Like I am 5. It is the basic tenets of learning for me where I try to distill any concept in a more palatable form. I couldn't reduce it to the freshman level. That means we don't really understand it. So, when I saw the ELI5 library that aims to interpret machine learning models, I just had to try it out. One of the basic problems we face while explaining our complex machine learning classifiers to the business is interpretability.


Advances in Machine Learning for the Behavioral Sciences

arXiv.org Machine Learning

This is most apparent when auto-encoders are trained, where a network is trained to map the input data upon itself but is forced to project them into a lower-dimensional embedding space on the way (Vincent et al., 2010). In addition to the conventional fully connected layers, there are various special types of network connections. For example, in computer vision, convolu-tional layers are commonly used, which train multiple sliding windows that move over the image data and process just a part of the image at a time, thereby learning to recognize local features. These layers are subsequently abstracted into more and more complex visual patterns (Krizhevsky et al., 2017). For temporal data, one can use recurrent neural networks, which do not make predictions for individual input vectors, but for a sequence of input vectors. To do so, they allow feeding abstracted information from previous data points forward to the next layers.


Tensorflow 2.0 Data Transformation for Text Classification

#artificialintelligence

In this article, we will utilize Tensorflow 2.0 and Python to create an end-to-end process for classifying movie reviews. Most Tensorflow tutorials focus on how to design and train a model using a preprocessed dataset. Typically preprocessing the data is the most time-consuming part of an AI project. This article will walk you through this process. Note: we are not trying to generate a state of the art classification model here.


A study of data and label shift in the LIME framework

arXiv.org Machine Learning

LIME is a popular approach for explaining a black-box prediction through an interpretable model that is trained on instances in the vicinity of the predicted instance. To generate these instances, LIME randomly selects a subset of the non-zero features of the predicted instance. After that, the perturbed instances are fed into the black-box model to obtain labels for these, which are then used for training the interpretable model. In this study, we present a systematic evaluation of the interpretable models that are output by LIME on the two use-cases that were considered in the original paper introducing the approach; text classification and object detection. The investigation shows that the perturbation and labeling phases result in both data and label shift. In addition, we study the correlation between the shift and the fidelity of the interpretable model and show that in certain cases the shift negatively correlates with the fidelity. Based on these findings, it is argued that there is a need for a new sampling approach that mitigates the shift in the LIME's framework.


Text Classification by MonkeyLearn - Simple and customizable text classification with AI Product Hunt

#artificialintelligence

Hello everybody We're excited to share our **classification feature**; we've been working on it for a while now and iterating on it based on feedback from our customers. These are the highlights: **Active learning** it minimizes the effort while tagging and training models. That means we had to build an extremely reliable service that can tackle high volume transactions in real time. I wanted to share some of the **top 3 most frequent use cases** we've seen so far. We're amazed by how business teams are leveraging our AI technology in their operations without the need of technical skills!


Hierarchical Transformers for Long Document Classification

arXiv.org Machine Learning

BERT, which stands for Bidirectional Encoder Representations from Transformers, is a recently introduced language representation model based upon the transfer learning paradigm. We extend its fine-tuning procedure to address one of its major limitations - applicability to inputs longer than a few hundred words, such as transcripts of human call conversations. Our method is conceptually simple. We segment the input into smaller chunks and feed each of them into the base model. Then, we propagate each output through a single recurrent layer, or another transformer, followed by a softmax activation. We obtain the final classification decision after the last segment has been consumed. We show that both BERT extensions are quick to fine-tune and converge after as little as 1 epoch of training on a small, domain-specific data set. We successfully apply them in three different tasks involving customer call satisfaction prediction and topic classification, and obtain a significant improvement over the baseline models in two of them.


HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories

arXiv.org Machine Learning

--GitHub has become an important platform for code sharing and scientific exchange. With the massive number of repositories available, there is a pressing need for topic-based search. Even though the topic label functionality has been introduced, the majority of GitHub repositories do not have any labels, impeding the utility of search and topic-based analysis. This work targets the automatic repository classification problem as keyword-driven hierarchical classification . Specifically, users only need to provide a label hierarchy with keywords to supply as supervision. This setting is flexible, adaptive to the users' needs, accounts for the different granularity of topic labels and requires minimal human effort. We identify three key challenges of this problem, namely (1) the presence of multi-modal signals; (2) supervision scarcity and bias; (3) supervision format mismatch. In recognition of these challenges, we propose the H IG ITC LASS framework, comprising of three modules: heterogeneous information network embedding; keyword enrichment; topic modeling and pseudo document generation. Experimental results on two GitHub repository collections confirm that H IG ITC LASS is superior to existing weakly-supervised and dataless hierarchical classification methods, especially in its ability to integrate both structured and unstructured data for repository classification. I NTRODUCTION For the computer science field, code repositories are an indispensable part of the knowledge dissemination process, containing valuable details for reproduction. For software engineers, sharing code also promotes the adoption of best practices and accelerates code development. The needs of the scientific community and that of software developers have facilitated the growth of online code collaboration platforms, the most popular of which is GitHub, with over 96 million repositories and 31 million users as of 2018. With the overwhelming number of repositories hosted on GitHub, there is a natural need to enable search functionality so that users can quickly target repositories of interest. To accommodate this need, GitHub introduced topic labels 1 which allowed users to declare topics for their own repositories.


Learning Only from Relevant Keywords and Unlabeled Documents

arXiv.org Machine Learning

We consider a document classification problem where document labels are absent but only relevant keywords of a target class and unlabeled documents are given. Although heuristic methods based on pseudo-labeling have been considered, theoretical understanding of this problem has still been limited. Moreover, previous methods cannot easily incorporate well-developed techniques in supervised text classification. In this paper, we propose a theoretically guaranteed learning framework that is simple to implement and has flexible choices of models, e.g., linear models or neural networks. We demonstrate how to optimize the area under the receiver operating characteristic curve (AUC) effectively and also discuss how to adjust it to optimize other well-known evaluation metrics such as the accuracy and F1-measure. Finally, we show the effectiveness of our framework using benchmark datasets.


Fine-grained Sentiment Classification using BERT

arXiv.org Machine Learning

Sentiment classification is an important process in understanding people's perception towards a product, service, or topic. Many natural language processing models have been proposed to solve the sentiment classification problem. However, most of them have focused on binary sentiment classification. In this paper, we use a promising deep learning model called BERT to solve the fine-grained sentiment classification task. Experiments show that our model outperforms other popular models for this task without sophisticated architecture. We also demonstrate the effectiveness of transfer learning in natural language processing in the process.


Automatic Classification of Sexual Harassment Cases

#artificialintelligence

In our case, the data was provided by Safecity India, which is a platform launched on 2012, that crowdsources personal stories of sexual harassment and abuse in public spaces [2]. They have collected over 10,000 stories from over 50 cities in India, Kenya, Cameroon, and Nepal. More specifically they provided us a .cvs Additionally to the focal tasks of this project and as part of the NLP channel we decided to automate the category classification based on the sexual harassment case descriptions. Performing this classification task manually is time-consuming and leaving it entirely on the hands of the victim could produce ambiguity in the discrimination of the categories.